[ https://issues.apache.org/jira/browse/HADOOP-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866128#comment-15866128 ]
Sean Mackrory commented on HADOOP-13868: ---------------------------------------- Just pinging on this - I'd like to resolve it soon. > New defaults for S3A multi-part configuration > --------------------------------------------- > > Key: HADOOP-13868 > URL: https://issues.apache.org/jira/browse/HADOOP-13868 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 > Affects Versions: 2.7.0, 3.0.0-alpha1 > Reporter: Sean Mackrory > Assignee: Sean Mackrory > Attachments: HADOOP-13868.001.patch, HADOOP-13868.002.patch, > optimizing-multipart-s3a.sh > > > I've been looking at a big performance regression when writing to S3 from > Spark that appears to have been introduced with HADOOP-12891. > In the Amazon SDK, the default threshold for multi-part copies is 320x the > threshold for multi-part uploads (and the block size is 20x bigger), so I > don't think it's necessarily wise for us to have them be the same. > I did some quick tests and it seems to me the sweet spot when multi-part > copies start being faster is around 512MB. It wasn't as significant, but > using 104857600 (Amazon's default) for the blocksize was also slightly better. > I propose we do the following, although they're independent decisions: > (1) Split the configuration. Ideally, I'd like to have > fs.s3a.multipart.copy.threshold and fs.s3a.multipart.upload.threshold (and > corresponding properties for the block size). But then there's the question > of what to do with the existing fs.s3a.multipart.* properties. Deprecation? > Leave it as a short-hand for configuring both (that's overridden by the more > specific properties?). > (2) Consider increasing the default values. In my tests, 256 MB seemed to be > where multipart uploads came into their own, and 512 MB was where multipart > copies started outperforming the alternative. Would be interested to hear > what other people have seen. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org