If you have lots of small files, distcp should handle that well -- it's supposed to distribute the transfer of files across the nodes in your cluster. Conductor looks interesting if you're trying to distribute the transfer of single, large file(s)...
right? -- Chris Miller On Wed, Mar 16, 2016 at 4:43 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > Hi Frank > > We have thousands of small files . Each file is between 6K to maybe 100k. > > Conductor looks interesting > > Andy > > From: Frank Austin Nothaft <fnoth...@berkeley.edu> > Date: Tuesday, March 15, 2016 at 11:59 AM > To: Andrew Davidson <a...@santacruzintegration.com> > Cc: "user @spark" <user@spark.apache.org> > Subject: Re: newbie HDFS S3 best practices > > Hard to say with #1 without knowing your application’s characteristics; > for #2, we use conductor <https://github.com/BD2KGenomics/conductor> with > IAM roles, .boto/.aws/credentials files. > > Frank Austin Nothaft > fnoth...@berkeley.edu > fnoth...@eecs.berkeley.edu > 202-340-0466 > > On Mar 15, 2016, at 11:45 AM, Andy Davidson <a...@santacruzintegration.com > <a...@santacruzintegration.com>> wrote: > > We use the spark-ec2 script to create AWS clusters as needed (we do not > use AWS EMR) > > 1. will we get better performance if we copy data to HDFS before we > run instead of reading directly from S3? > > 2. What is a good way to move results from HDFS to S3? > > > It seems like there are many ways to bulk copy to s3. Many of them require > we explicitly use the AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@ > <AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt>. This seems like a bad > idea? > > What would you recommend? > > Thanks > > Andy > > > >