On 11 Jul 2015, at 19:20, Aaron Davidson
ilike...@gmail.commailto:ilike...@gmail.com wrote:
Note that if you use multi-part upload, each part becomes 1 block, which allows
for multiple concurrent readers. One would typically use fixed-size block sizes
which align with Spark's default HDFS
Note that if you use multi-part upload, each part becomes 1 block, which
allows for multiple concurrent readers. One would typically use fixed-size
block sizes which align with Spark's default HDFS block size (64 MB, I
think) to ensure the reads are aligned.
On Sat, Jul 11, 2015 at 11:14 AM,
Are there any significant performance differences between reading text
files from S3 and hdfs?
latency is much bigger for S3 (if that matters)
And with HDFS you'd get data-locality that will boost your app performance.
I did some light experimenting on this.
see my presentation here for some benchmark numbers ..etc
http://www.slideshare.net/sujee/hadoop-to-sparkv2
from slide# 34
cheers
I recommend testing it for yourself. Even if you have no application, you
can just run the spark-ec2 script, log in, run spark-shell and try reading
files from an S3 bucket and from hdfs://master IP:9000/. (This is the
ephemeral HDFS cluster, which uses SSD.)
I just tested our application this