seek() is very, very expensive on s3, even short forward seeks. If your code does a lot of, it will kill performance. (forward seeks are better in s3a, which with Hadoop 2.3 is now something safe to use, and in the s3 client that Amazon include in EMR), but its still sluggish.
The other killers are -anything involving renaming files or directories -copy operations -listing lots of files. Finally, S3 is HDD backed,1 file == 1 block. In HDFS while you can have >3 processes reading different replicas of the same block of a file —giving 3x the bandwidth, disk bandwidth from an s3 object will be shared by all readers. The more readers: the worse performance On 9 Jul 2015, at 14:31, Daniel Darabos <daniel.dara...@lynxanalytics.com<mailto:daniel.dara...@lynxanalytics.com>> wrote: I recommend testing it for yourself. Even if you have no application, you can just run the spark-ec2 script, log in, run spark-shell and try reading files from an S3 bucket and from hdfs://<master IP>:9000/. (This is the ephemeral HDFS cluster, which uses SSD.) I just tested our application this way yesterday and found the SSD-based HDFS to outperform S3 by a factor of 2. I don't know the cause. It may be locality like Akhil suggests, or SSD vs HDD (assuming S3 is HDD-backed). Or the HDFS client library and protocol are just better than the S3 versions (which is HTTP-based and uses some 6-year-old libraries). On Thu, Jul 9, 2015 at 9:54 AM, Sujee Maniyam <su...@sujee.net<mailto:su...@sujee.net>> wrote: latency is much bigger for S3 (if that matters) And with HDFS you'd get data-locality that will boost your app performance. I did some light experimenting on this. see my presentation here for some benchmark numbers ..etc http://www.slideshare.net/sujee/hadoop-to-sparkv2 from slide# 34 cheers Sujee Maniyam (http://sujee.net<http://sujee.net/> | http://www.linkedin.com/in/sujeemaniyam ) teaching Spark<http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglist&utm_medium=email&utm_campaign=signature> On Wed, Jul 8, 2015 at 11:35 PM, Brandon White <bwwintheho...@gmail.com<mailto:bwwintheho...@gmail.com>> wrote: Are there any significant performance differences between reading text files from S3 and hdfs?