[ https://issues.apache.org/jira/browse/HADOOP-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868538#comment-17868538 ]
ASF GitHub Bot commented on HADOOP-19221: ----------------------------------------- shameersss1 commented on code in PR #6938: URL: https://github.com/apache/hadoop/pull/6938#discussion_r1690798138 ########## hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/AWSStatus500Exception.java: ########## @@ -22,21 +22,20 @@ /** * A 5xx response came back from a service. - * The 500 error considered retriable by the AWS SDK, which will have already + * <p> + * The 500 error is considered retryable by the AWS SDK, which will have already * tried it {@code fs.s3a.attempts.maximum} times before reaching s3a * code. - * How it handles other 5xx errors is unknown: S3A FS code will treat them - * as unrecoverable on the basis that they indicate some third-party store - * or gateway problem. + * <p> + * These are rare, but can occur; they are considered retryable. + * Note that HADOOP-19221 shows a failure condition where the + * SDK itself did not recover on retry from the error. + * Mitigation for the specific failure sequence is now in place. */ public class AWSStatus500Exception extends AWSServiceIOException { public AWSStatus500Exception(String operation, AwsServiceException cause) { super(operation, cause); } - @Override - public boolean retryable() { Review Comment: Will this make all 500 retriable ? I mean if we S3 throws exception like 500 S3 Server Internal error. Do we need to retry from S3A client as well ? > S3A: Unable to recover from failure of multipart block upload attempt "Status > Code: 400; Error Code: RequestTimeout" > -------------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-19221 > URL: https://issues.apache.org/jira/browse/HADOOP-19221 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.4.0 > Reporter: Steve Loughran > Assignee: Steve Loughran > Priority: Major > Labels: pull-request-available > > If a multipart PUT request fails for some reason (e.g. networrk error) then > all subsequent retry attempts fail with a 400 Response and ErrorCode > RequestTimeout . > {code} > Your socket connection to the server was not read from or written to within > the timeout period. Idle connections will be closed. (Service: Amazon S3; > Status Code: 400; Error Code: RequestTimeout; Request ID:; S3 Extended > Request ID: > {code} > The list of supporessed exceptions contains the root cause (the initial > failure was a 500); all retries failed to upload properly from the source > input stream {{RequestBody.fromInputStream(fileStream, size)}}. > Hypothesis: the mark/reset stuff doesn't work for input streams. On the v1 > sdk we would build a multipart block upload request passing in (file, offset, > length), the way we are now doing this doesn't recover. > probably fixable by providing our own {{ContentStreamProvider}} > implementations for > # file + offset + length > # bytebuffer > # byte array > The sdk does have explicit support for the memory ones, but they copy the > data blocks first. we don't want that as it would double the memory > requirements of active blocks. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org