[ https://issues.apache.org/jira/browse/HADOOP-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383164#comment-16383164 ]
Aaron Fabbri commented on HADOOP-13761: --------------------------------------- v12 patch attached which eliminates nested retries. Changes from v11: - reopen() RetryTranslated -> OnceTranslated. Use once() instead of retry(). All callers (lazySeek() and callers of onReadFailure()) have their own higher-level retries. - onReadFailure() RetryTranslated -> OnceTranslated, because reopen() no longer has internal retries. - Move lazySeek() in read() above the retry'd lambda; eliminates nested retry and matches what read(b, off, len) does. - Remove reopen() from read(b, off, len) after wrapped stream throws EOF. *This was existing code I changed*: I don't understand why you'd reopen() on EOF, [~ste...@apache.org]? Tested in us-west-2. I was able to reproduce the test timeout for ITestS3AFailureHandling, which is now passing for me. Running on little sleep but wanted to get a patch posted to give a chance for European Friday code review. Here is the delta from v11: {noformat} diff -ur ./src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java /Users/fabbri/Code/hadoop/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java --- ./src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java 2018-03-01 20:34:33.000000000 -0800 +++ /Users/fabbri/Code/hadoop/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java 2018-03-01 20:09:05.000000000 -0800 @@ -154,7 +154,7 @@ * @param length length requested * @throws IOException on any failure to open the object */ - @Retries.RetryTranslated + @Retries.OnceTranslated private synchronized void reopen(String reason, long targetPos, long length) throws IOException { @@ -178,7 +178,7 @@ } String text = String.format("Failed to %s %s at %d", (opencount == 0 ? "open" : "re-open"), uri, targetPos); - S3Object object = context.invoker.retry(text, uri, true, + S3Object object = context.getReadInvoker().once(text, uri, () -> client.getObject(request)); wrappedStream = object.getObjectContent(); contentRangeStart = targetPos; @@ -349,6 +349,12 @@ return -1; } + try { + lazySeek(nextReadPos, 1); + } catch (EOFException e) { + return -1; + } + // With S3Guard, the metadatastore gave us metadata for the file in // open(), so we use a slightly different retry policy. // read() may not be likely to fail, but reopen() does a GET which @@ -358,7 +364,6 @@ () -> { int b; try { - lazySeek(nextReadPos, 1); b = wrappedStream.read(); } catch (EOFException e) { return -1; @@ -387,7 +392,7 @@ * @param length length of data being attempted to read * @throws IOException any exception thrown on the re-open attempt. */ - @Retries.RetryTranslated + @Retries.OnceTranslated private void onReadFailure(IOException ioe, int length) throws IOException { LOG.info("Got exception while trying to read from stream {}" + @@ -439,7 +444,6 @@ try { bytes = wrappedStream.read(buf, off, len); } catch (EOFException e) { - onReadFailure(e, len); // the base implementation swallows EOFs. return -1; } catch (IOException e) { {noformat} > S3Guard: implement retries for DDB failures and throttling; translate > exceptions > -------------------------------------------------------------------------------- > > Key: HADOOP-13761 > URL: https://issues.apache.org/jira/browse/HADOOP-13761 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.0.0-beta1 > Reporter: Aaron Fabbri > Assignee: Aaron Fabbri > Priority: Blocker > Attachments: HADOOP-13761-004-to-005.patch, > HADOOP-13761-005-to-006-approx.diff.txt, HADOOP-13761-005.patch, > HADOOP-13761-006.patch, HADOOP-13761-007.patch, HADOOP-13761-008.patch, > HADOOP-13761-009.patch, HADOOP-13761-010.patch, HADOOP-13761-010.patch, > HADOOP-13761-011.patch, HADOOP-13761-012.patch, HADOOP-13761.001.patch, > HADOOP-13761.002.patch, HADOOP-13761.003.patch, HADOOP-13761.004.patch > > > Following the S3AFileSystem integration patch in HADOOP-13651, we need to add > retry logic. > In HADOOP-13651, I added TODO comments in most of the places retry loops are > needed, including: > - open(path). If MetadataStore reflects recent create/move of file path, but > we fail to read it from S3, retry. > - delete(path). If deleteObject() on S3 fails, but MetadataStore shows the > file exists, retry. > - rename(src,dest). If source path is not visible in S3 yet, retry. > - listFiles(). Skip for now. Not currently implemented in S3Guard. I will > create a separate JIRA for this as it will likely require interface changes > (i.e. prefix or subtree scan). > We may miss some cases initially and we should do failure injection testing > to make sure we're covered. Failure injection tests can be a separate JIRA > to make this easier to review. > We also need basic configuration parameters around retry policy. There > should be a way to specify maximum retry duration, as some applications would > prefer to receive an error eventually, than waiting indefinitely. We should > also be keeping statistics when inconsistency is detected and we enter a > retry loop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org