[ 
https://issues.apache.org/jira/browse/HADOOP-16792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009769#comment-17009769
 ] 

Steve Loughran commented on HADOOP-16792:
-----------------------------------------

I think this may be part of the long-neglected HADOOP-15603 "S3A to support 
configuring various AWS S3 client extended options"

That discusses how there are more options we need to cover -and that the other 
connections we make to AWS services also need to be configurable.

You're the first person to put their hand up and start doing this -it would be 
best to take up that original JIRA and do more broadly.

Regarding the PR -Great to see you adding tests. Do look at the testing s3a 
file and be aware that we really are that strict *with everyone*: No 
declaration of s3 endpoint -no review.

I'm worried here about how those timeouts affect write and copy operations. 
That is: if I try to copy a 10GB file it can take a long time. We do not want 
the operation to timeout, as the repeated attempts will also fail and then 
eventually the operation itself will be rejected. We see this with distcp 
timing out already. That means the tests will need to include the -Dscale 
tests, especially ITestS3AHugeFilesDiskBlocks with 
"fs.s3a.scale.test.huge.filesize" set to something big like "2G"  

And: this will need documentation. Secret options hidden in the source code 
aren't that useful downstream.

> Let s3 clients configure request timeout
> ----------------------------------------
>
>                 Key: HADOOP-16792
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16792
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.3.0
>            Reporter: Mustafa Iman
>            Priority: Major
>
> S3 does not guarantee latency. Every once in a while a request may straggle 
> and drive latency up for the greater procedure. In these cases, simply 
> timing-out the individual request is beneficial so that the client 
> application can retry. The retry tends to complete faster than the original 
> straggling request most of the time. Others experienced this issue too: 
> [https://arxiv.org/pdf/1911.11727.pdf] .
> S3 configuration already provides timeout facility via 
> `ClientConfiguration#setTimeout`. Exposing this configuration is beneficial 
> for latency sensitive applications. S3 client configuration is shared with 
> DynamoDB client which is also affected from unreliable worst case latency.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to