[ https://issues.apache.org/jira/browse/SPARK-21374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093417#comment-16093417 ]
Andrey Taptunov edited comment on SPARK-21374 at 7/19/17 5:19 PM: ------------------------------------------------------------------ [~ste...@apache.org] Indeed, while working on PR and debugging the code I see that code works only accidentally because caching is turned on by default. 1. Thanks for the advice. I doubt that it's related to type of the filesystem - I've only mentioned filesystem explicitly to show why "awsAcceesKeyId" not "access.key" is used with "s3" scheme in the example. Sorry for the confusion. 2. Unfortunately, it's not that easy - this example is only simplified version of what happens in my project. We don't have information about which buckets user will try to access in interactive mode so I can not enumerate them all in configuration. was (Author: andrey.t): [~ste...@apache.org] Indeed, while working on PR and debugging the code I see that current solution works because caching is turned on by default. 1. Thanks for the advice. I doubt that it's related to type of the filesystem - I've only mentioned filesystem explicitly to show why "awsAcceesKeyId" not "access.key" is used with "s3" scheme in the example. Sorry for the confusion. 2. Unfortunately, it's not that easy - this example is only simplified version of what happens in my project. We don't have information about which buckets user will try to access in interactive mode so I can not enumerate them all in configuration. > Reading globbed paths from S3 into DF doesn't work if filesystem caching is > disabled > ------------------------------------------------------------------------------------ > > Key: SPARK-21374 > URL: https://issues.apache.org/jira/browse/SPARK-21374 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.0.2, 2.1.1 > Reporter: Andrey Taptunov > > *Motivation:* > In my case I want to disable filesystem cache to be able to change S3's > access key and secret key on the fly to read from buckets with different > permissions. This works perfectly fine for RDDs but doesn't work for DFs. > *Example (works for RDD but fails for DataFrame):* > {code:java} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.sql.SparkSession > object SimpleApp { > def main(args: Array[String]) { > val awsAccessKeyId = "something" > val awsSecretKey = "something else" > val conf = new SparkConf().setAppName("Simple > Application").setMaster("local[*]") > val sc = new SparkContext(conf) > sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId) > sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey) > sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true) > > sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem") > sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp") > val spark = SparkSession.builder().config(conf).getOrCreate() > val rddFile = sc.textFile("s3://bucket/file.csv").count // ok > val rddGlob = sc.textFile("s3://bucket/*").count // ok > val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count > // ok > > val dfGlob = spark.read.format("csv").load("s3://bucket/*").count > // IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must > be specified as the username or password (respectively) > // of a s3 URL, or by setting the fs.s3.awsAccessKeyId or > fs.s3.awsSecretAccessKey properties (respectively). > > sc.stop() > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org