It looks like this is related to the underlying Hadoop configuration. Try to deploy the Hadoop configuration with your job with --files and --driver-class-path, or to the default /etc/hadoop/conf core-site.xml. If that is not an option (depending on how your Hadoop cluster is setup), then hard code the value vie -Dkey=value to see if it works. The downside is your credentials are exposed in plaintext in the java commands. or by defining it in spark-defaults.conf property "spark.executor.extraJavaOptions" e.g.s3n
spark.executor.extraJavaOptions "-Dfs.s3n.awsAccessKeyId=XXXXX -Dfs.s3n.awsSecretAccessKey=YYYY" s3spark.executor.extraJavaOptions "-Dfs.s3.awsAccessKeyId=XXXXX -Dfs.s3.awsSecretAccessKey=YYYY" Hope this works. Or embed them in the s3n path. Not good security practice though. From: mslimo...@gmail.com Date: Tue, 10 Feb 2015 10:57:47 -0500 Subject: Re: hadoopConfiguration for StreamingContext To: ak...@sigmoidanalytics.com CC: u...@spark.incubator.apache.org Thanks, Akhil. I had high hopes for #2, but tried all and no luck. I was looking at the source and found something interesting. The Stack Trace (below) directs me to FileInputDStream.scala (line 141). This is version 1.1.1, btw. Line 141 has: private def fs: FileSystem = { if (fs_ == null) fs_ = directoryPath.getFileSystem(new Configuration()) fs_ } So it looks to me like it doesn't make any attempt to use a configured HadoopConf. Here is the StackTrace: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at org.apache.hadoop.fs.s3native.$Proxy5.initialize(Unknown Source) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$fs(FileInputDStream.scala:141) at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107) at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75) ... On Tue, Feb 10, 2015 at 10:28 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: Try the following: 1. Set the access key and secret key in the sparkContext: ssc.sparkContext.hadoopConfiguration.set("AWS_ACCESS_KEY_ID",yourAccessKey) ssc.sparkContext.hadoopConfiguration.set("AWS_SECRET_ACCESS_KEY",yourSecretKey) 2. Set the access key and secret key in the environment before startingyour application: export AWS_ACCESS_KEY_ID=<your access> export AWS_SECRET_ACCESS_KEY=<your secret> 3. Set the access key and secret key inside the hadoop configurations val hadoopConf=ssc.sparkContext.hadoopConfiguration;hadoopConf.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")hadoopConf.set("fs.s3.awsAccessKeyId",yourAccessKey)hadoopConf.set("fs.s3.awsSecretAccessKey",yourSecretKey) 4. You can also try: val stream = ssc.textFileStream("s3n://yourAccessKey:yourSecretKey@<yourBucket>/path/")ThanksBest Regards On Tue, Feb 10, 2015 at 8:27 PM, Marc Limotte <mslimo...@gmail.com> wrote: I see that StreamingContext has a hadoopConfiguration() method, which can be used like this sample I found: sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "XXXXXX"); sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "XXXXXX"); But StreamingContext doesn't have the same thing. I want to use a StreamingContext with s3n: text file input, but can't find a way to set the AWS credentials. I also tried (with no success): adding the properties to conf/spark-defaults.conf$HADOOP_HOME/conf/hdfs-site.xmlENV variablesEmbedded as user:password in s3n://user:password@... (w/ url encoding)Setting the conf as above on a new SparkContext and passing that the StreamingContext constructor: StreamingContext(sparkContext: SparkContext, batchDuration: Duration)Can someone point me in the right direction for setting AWS creds (hadoop conf options) for streamingcontext? thanks,Marc LimotteClimate Corporation