[ https://issues.apache.org/jira/browse/HADOOP-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran resolved HADOOP-16189. ------------------------------------- Fix Version/s: 3.3.2 Resolution: Done AWS s3 xfer manager does this itself; we can see this from the audit traces > S3A copy/rename of large files to be parallelized as a multipart operation > -------------------------------------------------------------------------- > > Key: HADOOP-16189 > URL: https://issues.apache.org/jira/browse/HADOOP-16189 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.2.0 > Reporter: Steve Loughran > Priority: Major > Fix For: 3.3.2 > > > AWS docs on > [copying|https://docs.aws.amazon.com/AmazonS3/latest/dev/CopyingObjectsUsingAPIs.html] > * file < 5GB, can do this as a single operation > * file > 5GB you MUST use multipart API. > But even for files < 5GB, that's a really slow operation. And if HADOOP-16188 > is to be believed, there's not enough retrying. > Even if the transfer manager does swtich to multipart copies at some size, > just as we do our writes in 32-64 MB blocks, we can do the same for file > copy. Something like > {code} > l = len(src) > if L < fs.s3a.block.size: > single copy > else: > split file by blocks, initiate the upload, then execute each block copy as > an operation in the S3A thread pool; once all done: complete the operation. > {code} > + do retries on individual blocks copied, so a failure of one to copy doesn't > force retry of the whole upload. > This is potentially more complex than it sounds, as > * there's the need to track the ongoing copy operational state > * handle failures (abort, etc) > * use the if-modified/version headers to fail fast if the source file changes > partway through copy > * if the len(file)/fs.s3a.block.size > max-block-count, use a bigger block > size > * Maybe need to fall back to the classic operation > Overall, what sounds simple could get complex fast, or at least a bigger > piece of code. Needs to have some PoC of speedup before attempting -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org