Hi Everett I always do my initial data exploration and all our product development in my local dev env. I typically select a small data set and copy it to my local machine
My main() has an optional command line argument - - runLocal¹ Normally I load data from either hdfs:/// or S3n:// . If the arg is set I read from file:/// Sometime I use a CLI arg - -dataFileURL¹ So in your case I would log into my data cluster and use ³AWS s3 cp" to copy the data into my cluster and then use ³SCP² to copy the data from the data center back to my local env. Andy From: Everett Anderson <ever...@nuna.com.INVALID> Date: Tuesday, July 19, 2016 at 2:30 PM To: "user @spark" <user@spark.apache.org> Subject: Role-based S3 access outside of EMR > Hi, > > When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop > FileSystem implementation for s3:// URLs and seems to install the necessary S3 > credentials properties, as well. > > Often, it's nice during development to run outside of a cluster even with the > "local" Spark master, though, which I've found to be more troublesome. I'm > curious if I'm doing this the right way. > > There are two issues -- AWS credentials and finding the right combination of > compatible AWS SDK and Hadoop S3 FileSystem dependencies. > > Credentials and Hadoop Configuration > > For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY and > AWS_ACCESS_KEY_ID environment variables or putting the corresponding > properties in Hadoop XML config files, but it seems better practice to rely on > machine roles and not expose these. > > What I end up doing is, in code, when not running on EMR, creating a > DefaultAWSCredentialsProviderChain > <https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/Defa > ultAWSCredentialsProviderChain.html> and then installing the following > properties in the Hadoop Configuration using it: > > fs.s3.awsAccessKeyId > fs.s3n.awsAccessKeyId > fs.s3a.awsAccessKeyId > fs.s3.awsSecretAccessKey > fs.s3n.awsSecretAccessKey > fs.s3a.awsSecretAccessKey > > I also set the fs.s3.impl and fs.s3n.impl properties to > org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A > implementation since people usually use "s3://" URIs. > > SDK and File System Dependencies > > Some special combination <https://issues.apache.org/jira/browse/HADOOP-12420> > of the Hadoop version, AWS SDK version, and hadoop-aws is necessary. > > One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me seems to be > with > > --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2 > > Is this generally what people do? Is there a better way? > > I realize this isn't entirely a Spark-specific problem, but as so many people > seem to be using S3 with Spark, I imagine this community's faced the problem a > lot. > > Thanks! > > - Everett >