[ 
https://issues.apache.org/jira/browse/SPARK-21374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085645#comment-16085645
 ] 

Steve Loughran commented on SPARK-21374:
----------------------------------------

This is possibly a sign that your new configuration isn't having its auth 
values picked up, or they are incorrect (i.e properties are wrong). Its working 
for enabled caching as some other codepath has set them up with the right 
properties, and so when used in the DF, the previous params are picked up.

# if using Hadoop 2.7.x JARs, switch to s3a and use s3a in the URLs & settings. 
You don't need to set the fs.s3a.impl field either; done for yoiu.
# if you can upgrade to Hadoop 2.8 binaries, you can use per-bucket 
configuration; this does exactly what you want: lets you configure different 
auth details for different buckets, without having to play these games. See 
[https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Configuring_different_S3_buckets]
 and 
[https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_cloud-data-access/content/s3-per-bucket-configs.html]


going to 2.8 binaries (or anything with the feature backported to a 2.7.x 
variant) should solve your problem without you having to worry about what you 
are seeing here.

> Reading globbed paths from S3 into DF doesn't work if filesystem caching is 
> disabled
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-21374
>                 URL: https://issues.apache.org/jira/browse/SPARK-21374
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.2, 2.1.1
>            Reporter: Andrey Taptunov
>
> *Motivation:*
> In my case I want to disable filesystem cache to be able to change S3's 
> access key and secret key on the fly to read from buckets with different 
> permissions. This works perfectly fine for RDDs but doesn't work for DFs.
> *Example (works for RDD but fails for DataFrame):*
> {code:java}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SparkSession
> object SimpleApp {
>   def main(args: Array[String]) {
>     val awsAccessKeyId = "something"
>     val awsSecretKey = "something else"
>     val conf = new SparkConf().setAppName("Simple 
> Application").setMaster("local[*]")
>     val sc = new SparkContext(conf)
>     sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
>     sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey)
>     sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true)
>     
> sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>     sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp")
>     val spark = SparkSession.builder().config(conf).getOrCreate()
>     val rddFile = sc.textFile("s3://bucket/file.csv").count // ok
>     val rddGlob = sc.textFile("s3://bucket/*").count // ok
>     val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count 
> // ok
>     
>     val dfGlob = spark.read.format("csv").load("s3://bucket/*").count 
>     // IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must 
> be specified as the username or password (respectively)
>     // of a s3 URL, or by setting the fs.s3.awsAccessKeyId or 
> fs.s3.awsSecretAccessKey properties (respectively).
>    
>     sc.stop()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to