Re: S3 vs HDFS
On 11 Jul 2015, at 19:20, Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com wrote: Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I think) to ensure the reads are aligned. aah, I wasn't going to introduce that complication. As Aaron says, if you do multipart uploads to S3, you do get each part into its own block What we don't have in the S3 REST APIs is determining the partition count, hence block size. Instead the block size reported to Spark is simply the value of a constant set in the configuration. If you are trying to go multipart for performance: 1. you need to have a consistent block size across all your datasets 2. In your configurations, fs.s3n.multipart.uploads.block.size == fs.s3n.block.size http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
Re: S3 vs HDFS
Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I think) to ensure the reads are aligned. On Sat, Jul 11, 2015 at 11:14 AM, Steve Loughran ste...@hortonworks.com wrote: seek() is very, very expensive on s3, even short forward seeks. If your code does a lot of, it will kill performance. (forward seeks are better in s3a, which with Hadoop 2.3 is now something safe to use, and in the s3 client that Amazon include in EMR), but its still sluggish. The other killers are -anything involving renaming files or directories -copy operations -listing lots of files. Finally, S3 is HDD backed,1 file == 1 block. In HDFS while you can have 3 processes reading different replicas of the same block of a file —giving 3x the bandwidth, disk bandwidth from an s3 object will be shared by all readers. The more readers: the worse performance On 9 Jul 2015, at 14:31, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: I recommend testing it for yourself. Even if you have no application, you can just run the spark-ec2 script, log in, run spark-shell and try reading files from an S3 bucket and from hdfs://master IP:9000/. (This is the ephemeral HDFS cluster, which uses SSD.) I just tested our application this way yesterday and found the SSD-based HDFS to outperform S3 by a factor of 2. I don't know the cause. It may be locality like Akhil suggests, or SSD vs HDD (assuming S3 is HDD-backed). Or the HDFS client library and protocol are just better than the S3 versions (which is HTTP-based and uses some 6-year-old libraries). On Thu, Jul 9, 2015 at 9:54 AM, Sujee Maniyam su...@sujee.net wrote: latency is much bigger for S3 (if that matters) And with HDFS you'd get data-locality that will boost your app performance. I did some light experimenting on this. see my presentation here for some benchmark numbers ..etc http://www.slideshare.net/sujee/hadoop-to-sparkv2 from slide# 34 cheers Sujee Maniyam (http://sujee.net | http://www.linkedin.com/in/sujeemaniyam ) teaching Spark http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglistutm_medium=emailutm_campaign=signature On Wed, Jul 8, 2015 at 11:35 PM, Brandon White bwwintheho...@gmail.com wrote: Are there any significant performance differences between reading text files from S3 and hdfs?
Re: S3 vs HDFS
latency is much bigger for S3 (if that matters) And with HDFS you'd get data-locality that will boost your app performance. I did some light experimenting on this. see my presentation here for some benchmark numbers ..etc http://www.slideshare.net/sujee/hadoop-to-sparkv2 from slide# 34 cheers Sujee Maniyam (http://sujee.net | http://www.linkedin.com/in/sujeemaniyam ) teaching Spark http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglistutm_medium=emailutm_campaign=signature On Wed, Jul 8, 2015 at 11:35 PM, Brandon White bwwintheho...@gmail.com wrote: Are there any significant performance differences between reading text files from S3 and hdfs?
Re: S3 vs HDFS
I recommend testing it for yourself. Even if you have no application, you can just run the spark-ec2 script, log in, run spark-shell and try reading files from an S3 bucket and from hdfs://master IP:9000/. (This is the ephemeral HDFS cluster, which uses SSD.) I just tested our application this way yesterday and found the SSD-based HDFS to outperform S3 by a factor of 2. I don't know the cause. It may be locality like Akhil suggests, or SSD vs HDD (assuming S3 is HDD-backed). Or the HDFS client library and protocol are just better than the S3 versions (which is HTTP-based and uses some 6-year-old libraries). On Thu, Jul 9, 2015 at 9:54 AM, Sujee Maniyam su...@sujee.net wrote: latency is much bigger for S3 (if that matters) And with HDFS you'd get data-locality that will boost your app performance. I did some light experimenting on this. see my presentation here for some benchmark numbers ..etc http://www.slideshare.net/sujee/hadoop-to-sparkv2 from slide# 34 cheers Sujee Maniyam (http://sujee.net | http://www.linkedin.com/in/sujeemaniyam ) teaching Spark http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglistutm_medium=emailutm_campaign=signature On Wed, Jul 8, 2015 at 11:35 PM, Brandon White bwwintheho...@gmail.com wrote: Are there any significant performance differences between reading text files from S3 and hdfs?