Re: newbie HDFS S3 best practices

2016-03-16 Thread Chris Miller
Date: Tuesday, March 15, 2016 at 11:59 AM > To: Andrew Davidson <a...@santacruzintegration.com> > Cc: "user @spark" <user@spark.apache.org> > Subject: Re: newbie HDFS S3 best practices > > Hard to say with #1 without knowing your application’s characteristics; &g

Re: newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
ser @spark" <user@spark.apache.org> Subject: Re: newbie HDFS S3 best practices > Hard to say with #1 without knowing your application¹s characteristics; for > #2, we use conductor <https://github.com/BD2KGenomics/conductor> with IAM > roles, .boto/.aws/credentia

Re: newbie HDFS S3 best practices

2016-03-15 Thread Frank Austin Nothaft
Hard to say with #1 without knowing your application’s characteristics; for #2, we use conductor with IAM roles, .boto/.aws/credentials files. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 > On Mar 15, 2016, at

newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
We use the spark-ec2 script to create AWS clusters as needed (we do not use AWS EMR) 1. will we get better performance if we copy data to HDFS before we run instead of reading directly from S3? 2. What is a good way to move results from HDFS to S3? It seems like there are many ways to bulk copy