Hi Frank

We have thousands of small files . Each file is between 6K to maybe 100k.

Conductor looks interesting

Andy

From:  Frank Austin Nothaft <fnoth...@berkeley.edu>
Date:  Tuesday, March 15, 2016 at 11:59 AM
To:  Andrew Davidson <a...@santacruzintegration.com>
Cc:  "user @spark" <user@spark.apache.org>
Subject:  Re: newbie HDFS S3 best practices

> Hard to say with #1 without knowing your applicationĀ¹s characteristics; for
> #2, we use conductor <https://github.com/BD2KGenomics/conductor>  with IAM
> roles, .boto/.aws/credentials files.
> 
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466
> 
>> On Mar 15, 2016, at 11:45 AM, Andy Davidson <a...@santacruzintegration.com>
>> wrote:
>> 
>> We use the spark-ec2 script to create AWS clusters as needed (we do not use
>> AWS EMR)
>> 1. will we get better performance if we copy data to HDFS before we run
>> instead of reading directly from S3?
>>  2. What is a good way to move results from HDFS to S3?
>> 
>> 
>> It seems like there are many ways to bulk copy to s3. Many of them require we
>> explicitly use the AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
>> <mailto:AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt> . This seems like a
>> bad idea? 
>> 
>> What would you recommend?
>> 
>> Thanks
>> 
>> Andy
>> 
>> 
> 


Reply via email to