Re: S3 vs HDFS

Steve Loughran Sun, 12 Jul 2015 03:49:27 -0700

On 11 Jul 2015, at 19:20, Aaron Davidson 
<ilike...@gmail.com<mailto:ilike...@gmail.com>> wrote:


Note that if you use multi-part upload, each part becomes 1 block, which allows 
for multiple concurrent readers. One would typically use fixed-size block sizes 
which align with Spark's default HDFS block size (64 MB, I think) to ensure the 
reads are aligned.


aah, I wasn't going to introduce that complication.

As Aaron says, if you do multipart uploads to S3, you do get each part into its 
own block

What we don't have in the S3 REST APIs is determining the partition count, 
hence block size. Instead the block size reported to Spark is simply the value 
of a constant set in the configuration.


If you are trying to go multipart for performance:

1. you need to have a consistent block size across all your datasets
2. In your configurations, fs.s3n.multipart.uploads.block.size == 
fs.s3n.block.size
http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

Re: S3 vs HDFS

Reply via email to