Re: S3 vs HDFS

2015-07-12 Thread Steve Loughran

On 11 Jul 2015, at 19:20, Aaron Davidson 
ilike...@gmail.commailto:ilike...@gmail.com wrote:

Note that if you use multi-part upload, each part becomes 1 block, which allows 
for multiple concurrent readers. One would typically use fixed-size block sizes 
which align with Spark's default HDFS block size (64 MB, I think) to ensure the 
reads are aligned.


aah, I wasn't going to introduce that complication.

As Aaron says, if you do multipart uploads to S3, you do get each part into its 
own block

What we don't have in the S3 REST APIs is determining the partition count, 
hence block size. Instead the block size reported to Spark is simply the value 
of a constant set in the configuration.


If you are trying to go multipart for performance:

1. you need to have a consistent block size across all your datasets
2. In your configurations, fs.s3n.multipart.uploads.block.size == 
fs.s3n.block.size
http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html




Re: S3 vs HDFS

2015-07-11 Thread Aaron Davidson
Note that if you use multi-part upload, each part becomes 1 block, which
allows for multiple concurrent readers. One would typically use fixed-size
block sizes which align with Spark's default HDFS block size (64 MB, I
think) to ensure the reads are aligned.

On Sat, Jul 11, 2015 at 11:14 AM, Steve Loughran ste...@hortonworks.com
wrote:

  seek() is very, very expensive on s3, even short forward seeks. If your
 code does a lot of, it will kill performance. (forward seeks are better in
 s3a, which with Hadoop 2.3 is now something safe to use, and in the s3
 client that Amazon include in EMR), but its still sluggish.

  The other killers are
  -anything involving renaming files or directories
  -copy operations
  -listing lots of files.

  Finally, S3 is HDD backed,1 file == 1 block. In HDFS while you can have
 3 processes reading different replicas of the same block of a file —giving
 3x the bandwidth, disk bandwidth from an s3 object will be shared by all
 readers. The more readers: the worse performance


  On 9 Jul 2015, at 14:31, Daniel Darabos daniel.dara...@lynxanalytics.com
 wrote:

  I recommend testing it for yourself. Even if you have no application,
 you can just run the spark-ec2 script, log in, run spark-shell and try
 reading files from an S3 bucket and from hdfs://master IP:9000/. (This is
 the ephemeral HDFS cluster, which uses SSD.)

  I just tested our application this way yesterday and found the SSD-based
 HDFS to outperform S3 by a factor of 2. I don't know the cause. It may be
 locality like Akhil suggests, or SSD vs HDD (assuming S3 is HDD-backed). Or
 the HDFS client library and protocol are just better than the S3 versions
 (which is HTTP-based and uses some 6-year-old libraries).

 On Thu, Jul 9, 2015 at 9:54 AM, Sujee Maniyam su...@sujee.net wrote:

 latency is much bigger for S3 (if that matters)
 And with HDFS you'd get data-locality that will boost your app
 performance.

  I did some light experimenting on this.
 see my presentation here for some benchmark numbers ..etc
 http://www.slideshare.net/sujee/hadoop-to-sparkv2
  from slide# 34

  cheers
  Sujee Maniyam (http://sujee.net |
 http://www.linkedin.com/in/sujeemaniyam )
  teaching Spark
 http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglistutm_medium=emailutm_campaign=signature


 On Wed, Jul 8, 2015 at 11:35 PM, Brandon White bwwintheho...@gmail.com
 wrote:

 Are there any significant performance differences between reading text
 files from S3 and hdfs?







Re: S3 vs HDFS

2015-07-09 Thread Sujee Maniyam
latency is much bigger for S3 (if that matters)
And with HDFS you'd get data-locality that will boost your app performance.

I did some light experimenting on this.
see my presentation here for some benchmark numbers ..etc
http://www.slideshare.net/sujee/hadoop-to-sparkv2
from slide# 34

cheers
Sujee Maniyam (http://sujee.net | http://www.linkedin.com/in/sujeemaniyam )
teaching Spark
http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglistutm_medium=emailutm_campaign=signature


On Wed, Jul 8, 2015 at 11:35 PM, Brandon White bwwintheho...@gmail.com
wrote:

 Are there any significant performance differences between reading text
 files from S3 and hdfs?



Re: S3 vs HDFS

2015-07-09 Thread Daniel Darabos
I recommend testing it for yourself. Even if you have no application, you
can just run the spark-ec2 script, log in, run spark-shell and try reading
files from an S3 bucket and from hdfs://master IP:9000/. (This is the
ephemeral HDFS cluster, which uses SSD.)

I just tested our application this way yesterday and found the SSD-based
HDFS to outperform S3 by a factor of 2. I don't know the cause. It may be
locality like Akhil suggests, or SSD vs HDD (assuming S3 is HDD-backed). Or
the HDFS client library and protocol are just better than the S3 versions
(which is HTTP-based and uses some 6-year-old libraries).

On Thu, Jul 9, 2015 at 9:54 AM, Sujee Maniyam su...@sujee.net wrote:

 latency is much bigger for S3 (if that matters)
 And with HDFS you'd get data-locality that will boost your app performance.

 I did some light experimenting on this.
 see my presentation here for some benchmark numbers ..etc
 http://www.slideshare.net/sujee/hadoop-to-sparkv2
 from slide# 34

 cheers
 Sujee Maniyam (http://sujee.net | http://www.linkedin.com/in/sujeemaniyam
 )
 teaching Spark
 http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglistutm_medium=emailutm_campaign=signature


 On Wed, Jul 8, 2015 at 11:35 PM, Brandon White bwwintheho...@gmail.com
 wrote:

 Are there any significant performance differences between reading text
 files from S3 and hdfs?