einavh opened a new pull request, #4861:
URL: https://github.com/apache/hadoop/pull/4861

   
   I'm using EMR emr-6.5.0 cluster in us-east-1 with ec2 instances. cluster is 
running spark application using pyspark 3.2.1
   EMR is using Hadoop distribution:Amazon 3.2.1
   
   my spark application is reading from one bucket in us-west-2 and writing to 
a bucket in us-east-1.
   
   since I'm processing a large amount of data I'm paying a lot of money for 
the network transport . in order to reduce the cost I have create a vpc 
interface to s3 endpoint in us-west-2. inside the spark application I'm using 
aws cli for reading the file names from us-west-2 bucket and it is working 
through the s3 interface endpoint but when I use pyspark to read the data it is 
using the us-east-1 s3 endpoint instead of the us-west-2 endpoint.
   I tried to use per bucket configuration but it is being ignored although I 
added it to the defualt configuration and to spark submit call.
   
   I tried to set the following configuration but they are ignored:
     '--conf', 
"spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
     '--conf', 
"spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem",
     '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket 
-name>.endpoint=<us-west-2 endpoint >",
      '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket 
-name>.endpoint.region=us-west-2",
      '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket 
-name>.endpoint=<us-east-1end point >",
      '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket 
-name>.endpoint.region=us-east-1",
      '--conf', "spark.hadoop.fs.s3a.path.style.access=false",
      '--conf', 
"spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true",
      '--conf', 
"spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true",
       '--conf', "Dfs.s3a.bucket.<us-east-1-bucket -name>.endpoint=<us-east-1 
end point >",
      '--conf', "Dfs.s3a.bucket.<us-west-2-bucket -name>.endpoint=<us-west-2 
end point >",
      '--conf', "spark.eventLog.enabled=false",
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to