Re: S3 vs HDFS

2015-07-12 Thread Steve Loughran
On 11 Jul 2015, at 19:20, Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com wrote: Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS

Re: S3 vs HDFS

2015-07-11 Thread Aaron Davidson
Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I think) to ensure the reads are aligned. On Sat, Jul 11, 2015 at 11:14 AM,

S3 vs HDFS

2015-07-09 Thread Brandon White
Are there any significant performance differences between reading text files from S3 and hdfs?

Re: S3 vs HDFS

2015-07-09 Thread Sujee Maniyam
latency is much bigger for S3 (if that matters) And with HDFS you'd get data-locality that will boost your app performance. I did some light experimenting on this. see my presentation here for some benchmark numbers ..etc http://www.slideshare.net/sujee/hadoop-to-sparkv2 from slide# 34 cheers

Re: S3 vs HDFS

2015-07-09 Thread Daniel Darabos
I recommend testing it for yourself. Even if you have no application, you can just run the spark-ec2 script, log in, run spark-shell and try reading files from an S3 bucket and from hdfs://master IP:9000/. (This is the ephemeral HDFS cluster, which uses SSD.) I just tested our application this