Hi Teng, This is totally a flashing news for me, that people cannot use EMR in production because its not open sourced, I think that even Werner is not aware of such a problem. Is EMRFS opensourced? I am curious to know what does HA stand for?
Regards, Gourav On Thu, Jul 21, 2016 at 8:37 AM, Teng Qiu <teng...@gmail.com> wrote: > there are several reasons that AWS users do (can) not use EMR, one > point for us is that security compliance problem, EMR is totally not > open sourced, we can not use it in production system. second is that > EMR do not support HA yet. > > but to the original question from @Everett : > > -> Credentials and Hadoop Configuration > > as you said, best practice should be "rely on machine roles", they > called IAM roles. > > we are using EMRFS impl for accessing s3, it supports IAM role-based > access control well. you can take a look here: > https://github.com/zalando/spark/tree/branch-1.6-zalando > > or simply use our docker image (Dockerfile on github: > https://github.com/zalando/spark-appliance/tree/master/Dockerfile) > > docker run -d --net=host \ > -e START_MASTER="true" \ > -e START_WORKER="true" \ > -e START_WEBAPP="true" \ > -e START_NOTEBOOK="true" \ > registry.opensource.zalan.do/bi/spark:1.6.2-6 > > > -> SDK and File System Dependencies > > as mentioned above, using EMRFS libs solved this problem: > > http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-fs.html > > > 2016-07-21 8:37 GMT+02:00 Gourav Sengupta <gourav.sengu...@gmail.com>: > > But that would mean you would be accessing data over internet increasing > > data read latency, data transmission failures. Why are you not using EMR? > > > > Regards, > > Gourav > > > > On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson > <ever...@nuna.com.invalid> > > wrote: > >> > >> Thanks, Andy. > >> > >> I am indeed often doing something similar, now -- copying data locally > >> rather than dealing with the S3 impl selection and AWS credentials > issues. > >> It'd be nice if it worked a little easier out of the box, though! > >> > >> > >> On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson > >> <a...@santacruzintegration.com> wrote: > >>> > >>> Hi Everett > >>> > >>> I always do my initial data exploration and all our product development > >>> in my local dev env. I typically select a small data set and copy it > to my > >>> local machine > >>> > >>> My main() has an optional command line argument ‘- - runLocal’ > Normally I > >>> load data from either hdfs:/// or S3n:// . If the arg is set I read > from > >>> file:/// > >>> > >>> Sometime I use a CLI arg ‘- -dataFileURL’ > >>> > >>> So in your case I would log into my data cluster and use “AWS s3 cp" to > >>> copy the data into my cluster and then use “SCP” to copy the data from > the > >>> data center back to my local env. > >>> > >>> Andy > >>> > >>> From: Everett Anderson <ever...@nuna.com.INVALID> > >>> Date: Tuesday, July 19, 2016 at 2:30 PM > >>> To: "user @spark" <user@spark.apache.org> > >>> Subject: Role-based S3 access outside of EMR > >>> > >>> Hi, > >>> > >>> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop > >>> FileSystem implementation for s3:// URLs and seems to install the > necessary > >>> S3 credentials properties, as well. > >>> > >>> Often, it's nice during development to run outside of a cluster even > with > >>> the "local" Spark master, though, which I've found to be more > troublesome. > >>> I'm curious if I'm doing this the right way. > >>> > >>> There are two issues -- AWS credentials and finding the right > combination > >>> of compatible AWS SDK and Hadoop S3 FileSystem dependencies. > >>> > >>> Credentials and Hadoop Configuration > >>> > >>> For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY > and > >>> AWS_ACCESS_KEY_ID environment variables or putting the corresponding > >>> properties in Hadoop XML config files, but it seems better practice to > rely > >>> on machine roles and not expose these. > >>> > >>> What I end up doing is, in code, when not running on EMR, creating a > >>> DefaultAWSCredentialsProviderChain and then installing the following > >>> properties in the Hadoop Configuration using it: > >>> > >>> fs.s3.awsAccessKeyId > >>> fs.s3n.awsAccessKeyId > >>> fs.s3a.awsAccessKeyId > >>> fs.s3.awsSecretAccessKey > >>> fs.s3n.awsSecretAccessKey > >>> fs.s3a.awsSecretAccessKey > >>> > >>> I also set the fs.s3.impl and fs.s3n.impl properties to > >>> org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A > >>> implementation since people usually use "s3://" URIs. > >>> > >>> SDK and File System Dependencies > >>> > >>> Some special combination of the Hadoop version, AWS SDK version, and > >>> hadoop-aws is necessary. > >>> > >>> One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me > seems > >>> to be with > >>> > >>> --packages > >>> com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2 > >>> > >>> Is this generally what people do? Is there a better way? > >>> > >>> I realize this isn't entirely a Spark-specific problem, but as so many > >>> people seem to be using S3 with Spark, I imagine this community's > faced the > >>> problem a lot. > >>> > >>> Thanks! > >>> > >>> - Everett > >>> > >> > > >