Steve Loughran created HADOOP-19734:
---------------------------------------

             Summary: S3A: retry on MPU completion failure "One or more of the 
specified parts could not be found"
                 Key: HADOOP-19734
                 URL: https://issues.apache.org/jira/browse/HADOOP-19734
             Project: Hadoop Common
          Issue Type: Sub-task
          Components: fs/s3
    Affects Versions: 3.4.2
         Environment: aws s3 london
            Reporter: Steve Loughran



Experienced transient failure in test run of 
https://github.com/apache/hadoop/pull/7882 : all MPU complete posts failed 
because the request or parts were not found...the tests started succeeding 
60-90s later *and* a "hadoop s3guards uploads" call listed the outstanding 
uploads of the failing tests.

Hypothesis: a transient failure meant the server receiving the POST calls to 
complete the uploads was mistakenly reporting no upload IDs.

Outcome: all active write operations failed, without any retry attempts. This 
can lose data and fail jobs, even though the store may recover.

Proposed. The multipart uploads, especially block output stream, retry on this 
error; treat it as a connectivity issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to