Steve Loughran created HADOOP-19734:
---------------------------------------
Summary: S3A: retry on MPU completion failure "One or more of the
specified parts could not be found"
Key: HADOOP-19734
URL: https://issues.apache.org/jira/browse/HADOOP-19734
Project: Hadoop Common
Issue Type: Sub-task
Components: fs/s3
Affects Versions: 3.4.2
Environment: aws s3 london
Reporter: Steve Loughran
Experienced transient failure in test run of
https://github.com/apache/hadoop/pull/7882 : all MPU complete posts failed
because the request or parts were not found...the tests started succeeding
60-90s later *and* a "hadoop s3guards uploads" call listed the outstanding
uploads of the failing tests.
Hypothesis: a transient failure meant the server receiving the POST calls to
complete the uploads was mistakenly reporting no upload IDs.
Outcome: all active write operations failed, without any retry attempts. This
can lose data and fail jobs, even though the store may recover.
Proposed. The multipart uploads, especially block output stream, retry on this
error; treat it as a connectivity issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]