[ https://issues.apache.org/jira/browse/HADOOP-18448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran resolved HADOOP-18448. ------------------------------------- Resolution: Invalid > s3a endpoint per bucket configuration in pyspark is ignored > ----------------------------------------------------------- > > Key: HADOOP-18448 > URL: https://issues.apache.org/jira/browse/HADOOP-18448 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.2.1 > Reporter: Einav Hollander > Priority: Major > > I'm using EMR emr-6.5.0 cluster in us-east-1 with ec2 instances. cluster is > running spark application using pyspark 3.2.1 > EMR is using Hadoop distribution:Amazon 3.2.1 > my spark application is reading from one bucket in us-west-2 and writing to a > bucket in us-east-1. > since I'm processing a large amount of data I'm paying a lot of money for the > network transport . in order to reduce the cost I have create a vpc interface > to s3 endpoint in us-west-2. inside the spark application I'm using aws cli > for reading the file names from us-west-2 bucket and it is working through > the s3 interface endpoint but when I use pyspark to read the data it is using > the us-east-1 s3 endpoint instead of the us-west-2 endpoint. > I tried to use per bucket configuration but it is being ignored although I > added it to the defualt configuration and to spark submit call. > I tried to set the following configuration but they are ignored: > '--conf', > "spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain", > '--conf', "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem", > '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket -name>.endpoint=<my > vpc endpoint>", > '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket > -name>.endpoint.region=us-west-2", > '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket -name>.endpoint=<vpc > gateway endpoint>", > '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket > -name>.endpoint.region=us-east-1", > '--conf', "spark.hadoop.fs.s3a.path.style.access=false" -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org