I recommend testing it for yourself. Even if you have no application, you can just run the spark-ec2 script, log in, run spark-shell and try reading files from an S3 bucket and from hdfs://<master IP>:9000/. (This is the ephemeral HDFS cluster, which uses SSD.)
I just tested our application this way yesterday and found the SSD-based HDFS to outperform S3 by a factor of 2. I don't know the cause. It may be locality like Akhil suggests, or SSD vs HDD (assuming S3 is HDD-backed). Or the HDFS client library and protocol are just better than the S3 versions (which is HTTP-based and uses some 6-year-old libraries). On Thu, Jul 9, 2015 at 9:54 AM, Sujee Maniyam <su...@sujee.net> wrote: > latency is much bigger for S3 (if that matters) > And with HDFS you'd get data-locality that will boost your app performance. > > I did some light experimenting on this. > see my presentation here for some benchmark numbers ..etc > http://www.slideshare.net/sujee/hadoop-to-sparkv2 > from slide# 34 > > cheers > Sujee Maniyam (http://sujee.net | http://www.linkedin.com/in/sujeemaniyam > ) > teaching Spark > <http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglist&utm_medium=email&utm_campaign=signature> > > > On Wed, Jul 8, 2015 at 11:35 PM, Brandon White <bwwintheho...@gmail.com> > wrote: > >> Are there any significant performance differences between reading text >> files from S3 and hdfs? >> > >