[ https://issues.apache.org/jira/browse/HADOOP-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868984#comment-17868984 ]
ASF GitHub Bot commented on HADOOP-19221: ----------------------------------------- steveloughran commented on code in PR #6938: URL: https://github.com/apache/hadoop/pull/6938#discussion_r1693359752 ########## hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java: ########## @@ -228,15 +228,15 @@ protected Map<Class<? extends Exception>, RetryPolicy> createExceptionMap() { // throttled requests are can be retried, always policyMap.put(AWSServiceThrottledException.class, throttlePolicy); - // Status 5xx error code is an immediate failure + // Status 5xx error code has historically been treated as an immediate failure // this is sign of a server-side problem, and while // rare in AWS S3, it does happen on third party stores. // (out of disk space, etc). // by the time we get here, the aws sdk will have - // already retried. + // already retried, if it is configured to retry exceptions. // there is specific handling for some 5XX codes (501, 503); // this is for everything else - policyMap.put(AWSStatus500Exception.class, fail); + policyMap.put(AWSStatus500Exception.class, retryAwsClientExceptions); Review Comment: see the full comment below. along with that I really don't like looking in error strings, way too brittle for production code. Even in tests I like to share the text across production and test classes as constants. (yes, I know about org.apache.hadoop.fs.s3a.impl.ErrorTranslation ....doesn't mean I like it) > S3A: Unable to recover from failure of multipart block upload attempt "Status > Code: 400; Error Code: RequestTimeout" > -------------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-19221 > URL: https://issues.apache.org/jira/browse/HADOOP-19221 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.4.0 > Reporter: Steve Loughran > Assignee: Steve Loughran > Priority: Major > Labels: pull-request-available > > If a multipart PUT request fails for some reason (e.g. networrk error) then > all subsequent retry attempts fail with a 400 Response and ErrorCode > RequestTimeout . > {code} > Your socket connection to the server was not read from or written to within > the timeout period. Idle connections will be closed. (Service: Amazon S3; > Status Code: 400; Error Code: RequestTimeout; Request ID:; S3 Extended > Request ID: > {code} > The list of supporessed exceptions contains the root cause (the initial > failure was a 500); all retries failed to upload properly from the source > input stream {{RequestBody.fromInputStream(fileStream, size)}}. > Hypothesis: the mark/reset stuff doesn't work for input streams. On the v1 > sdk we would build a multipart block upload request passing in (file, offset, > length), the way we are now doing this doesn't recover. > probably fixable by providing our own {{ContentStreamProvider}} > implementations for > # file + offset + length > # bytebuffer > # byte array > The sdk does have explicit support for the memory ones, but they copy the > data blocks first. we don't want that as it would double the memory > requirements of active blocks. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org