[jira] [Commented] (HADOOP-13761) S3Guard: implement retries for DDB failures and throttling; translate exceptions

Steve Loughran (JIRA) Mon, 26 Feb 2018 06:53:19 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376983#comment-16376983
 ]


Steve Loughran commented on HADOOP-13761:
-----------------------------------------

-1

I'd committed this locally and was doing the cherry pick to branch-3.1 & got a 
test timeout in {{ITestS3AFailureHandling.testReadFileChanged}} on that branch

{code}
java.lang.Exception: test timed out after 600000 milliseconds
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:344)
        at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:256)
        at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:231)
        at 
org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:181)
        at 
org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$1(S3AInputStream.java:327)
        at 
org.apache.hadoop.fs.s3a.S3AInputStream$$Lambda$23/570183744.execute(Unknown 
Source)
        at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$2(Invoker.java:190)
        at 
org.apache.hadoop.fs.s3a.Invoker$$Lambda$24/1791082625.execute(Unknown Source)
        at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
        at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:260)
        at 
org.apache.hadoop.fs.s3a.Invoker$$Lambda$13/1380113967.execute(Unknown Source)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:317)
        at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:256)
        at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:188)
        at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:210)
        at 
org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:320)
        at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:423)
        at org.apache.hadoop.fs.FSInputStream.read(FSInputStream.java:75)
        at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92)
        at 
org.apache.hadoop.fs.s3a.ITestS3AFailureHandling.testReadFileChanged(ITestS3AFailureHandling.java:94)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
        at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{code}

This is the bit where read failures are being retried: are EOF exceptions being 
over-retried? 

Switching back to my dev terminal and trunk and I managed to get a 400 on all 
those failure tests, which makes me think maybe S3 ireland has startedf is 
playing up: switching to london fixes it.

Anyway, assuming there is a problem with S3 in a region, is this recovery code 
going to keep trying too often. That is: are we overdoing in retry on retry, as 
lazySeek does a retry with the chosen retryInvoker, and reopen does its retry 
too: with retry on retry things are taking so long to fail in a read that tests 
time out.

I think what needs to be done is to not have that double retry, or have the 
outer retry policy only handle FNFEs, and even then, only on s3guard.

> S3Guard: implement retries for DDB failures and throttling; translate 
> exceptions
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-13761
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13761
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.0.0-beta1
>            Reporter: Aaron Fabbri
>            Assignee: Aaron Fabbri
>            Priority: Blocker
>         Attachments: HADOOP-13761-004-to-005.patch, 
> HADOOP-13761-005-to-006-approx.diff.txt, HADOOP-13761-005.patch, 
> HADOOP-13761-006.patch, HADOOP-13761-007.patch, HADOOP-13761-008.patch, 
> HADOOP-13761-009.patch, HADOOP-13761-010.patch, HADOOP-13761-010.patch, 
> HADOOP-13761-011.patch, HADOOP-13761.001.patch, HADOOP-13761.002.patch, 
> HADOOP-13761.003.patch, HADOOP-13761.004.patch
>
>
> Following the S3AFileSystem integration patch in HADOOP-13651, we need to add 
> retry logic.
> In HADOOP-13651, I added TODO comments in most of the places retry loops are 
> needed, including:
> - open(path).  If MetadataStore reflects recent create/move of file path, but 
> we fail to read it from S3, retry.
> - delete(path).  If deleteObject() on S3 fails, but MetadataStore shows the 
> file exists, retry.
> - rename(src,dest).  If source path is not visible in S3 yet, retry.
> - listFiles(). Skip for now. Not currently implemented in S3Guard. I will 
> create a separate JIRA for this as it will likely require interface changes 
> (i.e. prefix or subtree scan).
> We may miss some cases initially and we should do failure injection testing 
> to make sure we're covered.  Failure injection tests can be a separate JIRA 
> to make this easier to review.
> We also need basic configuration parameters around retry policy.  There 
> should be a way to specify maximum retry duration, as some applications would 
> prefer to receive an error eventually, than waiting indefinitely.  We should 
> also be keeping statistics when inconsistency is detected and we enter a 
> retry loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13761) S3Guard: implement retries for DDB failures and throttling; translate exceptions

Reply via email to