[ 
https://issues.apache.org/jira/browse/HADOOP-16792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022944#comment-17022944
 ] 

Steve Loughran commented on HADOOP-16792:
-----------------------------------------

committed to trunk

h2. Mustafa's scale test report from the github PR

I ran ITestS3AHugeFilesDiskBlocks#test_010_CreateHugeFile with some 
combinations.

The first experiments used default file size and partition for huge files. I 
set request timeout to 1 ms for the first experiment. Test file system failed 
to initialize. This is because verifyBuckets call in the beginning times out 
repeteadly. This is retried within AWS sdk code up to 
`com.amazonaws.ClientConfiguration#maxErrorRetry` times. This value is 
configurable from Hadoop side via property `fs.s3a.attempts.maximum`. All of 
this retries are opaque to Hadoop. At the end of this retry cycle, aws sdk 
returns failure to Hadoop's Invoker. Then, Invoker evaluates whether to retry 
this operation or not according to its configured retry policies. I saw that 
verifyBuckets call were not retried on Invoker level.

In a followup experiment, I set request timeout to 200ms, which is enough for 
verifyBuckets call to succeed but short enough that multi part uploads fail. In 
these cases, again AWS sdk retries these http requests up to `maxErrorRetry` 
times. After this http request fails `maxErrorRetry` times, Invoker's retry 
mechanism kicks in. I observed Invoker to retry these operations up to 
`fs.s3a.retry.limit` times conforming to configured exponential back-off 
limited retry policy. After all these `fs.s3a.retry.limit`*`maxErrorRetry` 
retries, Invoker bubbles up AWSClientIOException to the user as shown below:

{code}
org.apache.hadoop.fs.s3a.AWSClientIOException: upload part on 
tests3ascale/disk/hugefile: com.amazonaws.SdkClientException: Unable to execute 
HTTP request: Request did not complete before the request timeout 
configuration.: Unable to execute HTTP request: Request did not complete before 
the request timeout configuration.
        at 
org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:205)
        at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:112)
        at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:315)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:407)
        at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:311)
{code}

Later, I ran the test with 256M file size and 32M partitionsize. I set the 
request timeout to 5s. My goal was to introduce a few retries due to short 
request timeout, but complete the upload operation with the use of retries. I 
managed to do that. I saw some retries due to short request timeout, but they 
were retried and the upload operation completed successfully. The test failed 
anyway because it also expected that `TRANSFER_PART_FAILED_EVENT`  be 0. This 
is obviously not the case because some transfers failed but they were retried. 
I checked S3 and verified that the file was there. I also verified that 
temporary partition files were cleared in my local drive.

When I run the same experiment with 8GB file and 128M partitions but with small 
request timeout, the test fails due to uploads not being completed.

I also ran a soak test with 8GB files with a large request timeout. This passed 
fine as expected because timeout value was high enough to let uploads complete.


> Let s3 clients configure request timeout
> ----------------------------------------
>
>                 Key: HADOOP-16792
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16792
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.3.0
>            Reporter: Mustafa Iman
>            Assignee: Mustafa Iman
>            Priority: Major
>
> S3 does not guarantee latency. Every once in a while a request may straggle 
> and drive latency up for the greater procedure. In these cases, simply 
> timing-out the individual request is beneficial so that the client 
> application can retry. The retry tends to complete faster than the original 
> straggling request most of the time. Others experienced this issue too: 
> [https://arxiv.org/pdf/1911.11727.pdf] .
> S3 configuration already provides timeout facility via 
> `ClientConfiguration#setTimeout`. Exposing this configuration is beneficial 
> for latency sensitive applications. S3 client configuration is shared with 
> DynamoDB client which is also affected from unreliable worst case latency.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to