[jira] [Resolved] (HADOOP-16189) S3A copy/rename of large files to be parallelized as a multipart operation

Steve Loughran (Jira) Mon, 02 Aug 2021 08:21:22 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran resolved HADOOP-16189.
-------------------------------------
    Fix Version/s: 3.3.2
       Resolution: Done

AWS s3 xfer manager does this itself; we can see this from the audit traces

> S3A copy/rename of large files to be parallelized as a multipart operation
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-16189
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16189
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Steve Loughran
>            Priority: Major
>             Fix For: 3.3.2
>
>
> AWS docs on 
> [copying|https://docs.aws.amazon.com/AmazonS3/latest/dev/CopyingObjectsUsingAPIs.html]
> * file < 5GB, can do this as a single operation
> * file > 5GB you MUST use multipart API.
> But even for files < 5GB, that's a really slow operation. And if HADOOP-16188 
> is to be believed, there's not enough retrying.
> Even if the transfer manager does swtich to multipart copies at some size, 
> just as we do our writes in 32-64 MB blocks, we can do the same for file 
> copy. Something like
> {code}
> l = len(src)
> if L < fs.s3a.block.size: 
>    single copy
> else: 
>   split file by blocks, initiate the upload, then execute each block copy as 
> an operation in the S3A thread pool; once all done: complete the operation.
> {code}
> + do retries on individual blocks copied, so a failure of one to copy doesn't 
> force retry of the whole upload.
> This is potentially more complex than it sounds, as 
> * there's the need to track the ongoing copy operational state
> * handle failures (abort, etc)
> * use the if-modified/version headers to fail fast if the source file changes 
> partway through copy
> * if the len(file)/fs.s3a.block.size >  max-block-count, use a bigger block 
> size
> * Maybe need to fall back to the classic operation
> Overall, what sounds simple could get complex fast, or at least a bigger 
> piece of code. Needs to have some PoC of speedup before attempting



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

[jira] [Resolved] (HADOOP-16189) S3A copy/rename of large files to be parallelized as a multipart operation

Reply via email to