Sean Mackrory created HADOOP-13868:
--------------------------------------

             Summary: Configure multi-part copies and uploads separately
                 Key: HADOOP-13868
                 URL: https://issues.apache.org/jira/browse/HADOOP-13868
             Project: Hadoop Common
          Issue Type: Bug
            Reporter: Sean Mackrory
            Assignee: Sean Mackrory


I've been looking at a big performance regression when writing to S3 from Spark 
that appears to have been introduced with HADOOP-12891.

In the Amazon SDK, the default threshold for multi-part copies is 320x the 
threshold for multi-part uploads (and the block size is 20x bigger), so I don't 
think it's wise for us 

I did some quick tests and it seems to me the sweet spot when multi-part copies 
start being faster is around 512MB. It wasn't as significant, but using 
104857600 (Amazon's default) for the blocksize was also slightly better.

I propose we do the following, although they're independent.

(1) Split the configuration. Ideally, I'd like to have 
fs.s3a.multipart.copy.threshold and fs.s3a.multipart.upload.threshold (and 
corresponding properties for the block size). But then there's the question of 
what to do with the existing fs.s3a.multipart.* properties. Deprecation? Leave 
it as a short-hand for configuring both (that's overridden by the more specific 
properties?).

(2) Consider increasing the default values. In my tests, 256 MB seemed to be 
where multipart uploads came into their own, and 512 MB was where multipart 
copies started outperforming the alternative. Would be interested to hear what 
other people have seen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to