Hi, When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop FileSystem implementation for s3:// URLs and seems to install the necessary S3 credentials properties, as well.
Often, it's nice during development to run outside of a cluster even with the "local" Spark master, though, which I've found to be more troublesome. I'm curious if I'm doing this the right way. There are two issues -- AWS credentials and finding the right combination of compatible AWS SDK and Hadoop S3 FileSystem dependencies. *Credentials and Hadoop Configuration* For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID environment variables or putting the corresponding properties in Hadoop XML config files, but it seems better practice to rely on machine roles and not expose these. What I end up doing is, in code, when not running on EMR, creating a DefaultAWSCredentialsProviderChain <https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html> and then installing the following properties in the Hadoop Configuration using it: fs.s3.awsAccessKeyId fs.s3n.awsAccessKeyId fs.s3a.awsAccessKeyId fs.s3.awsSecretAccessKey fs.s3n.awsSecretAccessKey fs.s3a.awsSecretAccessKey I also set the fs.s3.impl and fs.s3n.impl properties to org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A implementation since people usually use "s3://" URIs. *SDK and File System Dependencies* Some special combination <https://issues.apache.org/jira/browse/HADOOP-12420> of the Hadoop version, AWS SDK version, and hadoop-aws is necessary. One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me seems to be with --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2 Is this generally what people do? Is there a better way? I realize this isn't entirely a Spark-specific problem, but as so many people seem to be using S3 with Spark, I imagine this community's faced the problem a lot. Thanks! - Everett