Hi Everett

I always do my initial data exploration and all our product development in
my local dev env. I typically select a small data set and copy it to my
local machine

My main() has an optional command line argument Œ- - runLocal¹ Normally I
load data from either hdfs:/// or S3n:// . If the arg is set I read from
file:///

Sometime I use a CLI arg Œ- -dataFileURL¹

So in your case I would log into my data cluster and use ³AWS s3 cp" to copy
the data into my cluster and then use ³SCP² to copy the data from the data
center back to my local env.

Andy

From:  Everett Anderson <ever...@nuna.com.INVALID>
Date:  Tuesday, July 19, 2016 at 2:30 PM
To:  "user @spark" <user@spark.apache.org>
Subject:  Role-based S3 access outside of EMR

> Hi,
> 
> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
> FileSystem implementation for s3:// URLs and seems to install the necessary S3
> credentials properties, as well.
> 
> Often, it's nice during development to run outside of a cluster even with the
> "local" Spark master, though, which I've found to be more troublesome. I'm
> curious if I'm doing this the right way.
> 
> There are two issues -- AWS credentials and finding the right combination of
> compatible AWS SDK and Hadoop S3 FileSystem dependencies.
> 
> Credentials and Hadoop Configuration
> 
> For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY and
> AWS_ACCESS_KEY_ID environment variables or putting the corresponding
> properties in Hadoop XML config files, but it seems better practice to rely on
> machine roles and not expose these.
> 
> What I end up doing is, in code, when not running on EMR, creating a
> DefaultAWSCredentialsProviderChain
> <https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/Defa
> ultAWSCredentialsProviderChain.html>  and then installing the following
> properties in the Hadoop Configuration using it:
> 
> fs.s3.awsAccessKeyId
> fs.s3n.awsAccessKeyId
> fs.s3a.awsAccessKeyId
> fs.s3.awsSecretAccessKey
> fs.s3n.awsSecretAccessKey
> fs.s3a.awsSecretAccessKey
> 
> I also set the fs.s3.impl and fs.s3n.impl properties to
> org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A
> implementation since people usually use "s3://" URIs.
> 
> SDK and File System Dependencies
> 
> Some special combination <https://issues.apache.org/jira/browse/HADOOP-12420>
> of the Hadoop version, AWS SDK version, and hadoop-aws is necessary.
> 
> One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me seems to be
> with
> 
> --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
> 
> Is this generally what people do? Is there a better way?
> 
> I realize this isn't entirely a Spark-specific problem, but as so many people
> seem to be using S3 with Spark, I imagine this community's faced the problem a
> lot.
> 
> Thanks!
> 
> - Everett
> 


Reply via email to