On 11 Jul 2015, at 19:20, Aaron Davidson <ilike...@gmail.com<mailto:ilike...@gmail.com>> wrote:
Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I think) to ensure the reads are aligned. aah, I wasn't going to introduce that complication. As Aaron says, if you do multipart uploads to S3, you do get each part into its own block What we don't have in the S3 REST APIs is determining the partition count, hence block size. Instead the block size reported to Spark is simply the value of a constant set in the configuration. If you are trying to go multipart for performance: 1. you need to have a consistent block size across all your datasets 2. In your configurations, fs.s3n.multipart.uploads.block.size == fs.s3n.block.size http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html