Eric Payne created TEZ-4400:
-------------------------------

             Summary: Tez takes a long time to recover from shuffle data not 
found errors
                 Key: TEZ-4400
                 URL: https://issues.apache.org/jira/browse/TEZ-4400
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Eric Payne


Recently a lot of nodes ended up having their shuffle data wiped during an NM 
upgrade. It took many of the TEZ jobs far too long to recover. This should be 
something that can be quickly recovered. The NM is returning an error code 
indicating the shuffle data was not found, and that alone is sufficient 
evidence to know that no amount of retries is likely to fix the issue. As soon 
as the NM reports shuffle data as not found, the task should report the not 
found error to the AM and the AM should treat even a single not found error as 
sufficient cause to re-run the upstream task.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to