Eric Payne created TEZ-4400:
-------------------------------
Summary: Tez takes a long time to recover from shuffle data not
found errors
Key: TEZ-4400
URL: https://issues.apache.org/jira/browse/TEZ-4400
Project: Apache Tez
Issue Type: Bug
Reporter: Eric Payne
Recently a lot of nodes ended up having their shuffle data wiped during an NM
upgrade. It took many of the TEZ jobs far too long to recover. This should be
something that can be quickly recovered. The NM is returning an error code
indicating the shuffle data was not found, and that alone is sufficient
evidence to know that no amount of retries is likely to fix the issue. As soon
as the NM reports shuffle data as not found, the task should report the not
found error to the AM and the AM should treat even a single not found error as
sufficient cause to re-run the upstream task.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)