Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I think) to ensure the reads are aligned.
On Sat, Jul 11, 2015 at 11:14 AM, Steve Loughran <ste...@hortonworks.com> wrote: > seek() is very, very expensive on s3, even short forward seeks. If your > code does a lot of, it will kill performance. (forward seeks are better in > s3a, which with Hadoop 2.3 is now something safe to use, and in the s3 > client that Amazon include in EMR), but its still sluggish. > > The other killers are > -anything involving renaming files or directories > -copy operations > -listing lots of files. > > Finally, S3 is HDD backed,1 file == 1 block. In HDFS while you can have > >3 processes reading different replicas of the same block of a file —giving > 3x the bandwidth, disk bandwidth from an s3 object will be shared by all > readers. The more readers: the worse performance > > > On 9 Jul 2015, at 14:31, Daniel Darabos <daniel.dara...@lynxanalytics.com> > wrote: > > I recommend testing it for yourself. Even if you have no application, > you can just run the spark-ec2 script, log in, run spark-shell and try > reading files from an S3 bucket and from hdfs://<master IP>:9000/. (This is > the ephemeral HDFS cluster, which uses SSD.) > > I just tested our application this way yesterday and found the SSD-based > HDFS to outperform S3 by a factor of 2. I don't know the cause. It may be > locality like Akhil suggests, or SSD vs HDD (assuming S3 is HDD-backed). Or > the HDFS client library and protocol are just better than the S3 versions > (which is HTTP-based and uses some 6-year-old libraries). > > On Thu, Jul 9, 2015 at 9:54 AM, Sujee Maniyam <su...@sujee.net> wrote: > >> latency is much bigger for S3 (if that matters) >> And with HDFS you'd get data-locality that will boost your app >> performance. >> >> I did some light experimenting on this. >> see my presentation here for some benchmark numbers ..etc >> http://www.slideshare.net/sujee/hadoop-to-sparkv2 >> from slide# 34 >> >> cheers >> Sujee Maniyam (http://sujee.net | >> http://www.linkedin.com/in/sujeemaniyam ) >> teaching Spark >> <http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglist&utm_medium=email&utm_campaign=signature> >> >> >> On Wed, Jul 8, 2015 at 11:35 PM, Brandon White <bwwintheho...@gmail.com> >> wrote: >> >>> Are there any significant performance differences between reading text >>> files from S3 and hdfs? >>> >> >> > >