[
https://issues.apache.org/jira/browse/TEZ-4400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ayush Saxena resolved TEZ-4400.
-------------------------------
Resolution: Duplicate
> Tez takes a long time to recover from shuffle data not found errors
> -------------------------------------------------------------------
>
> Key: TEZ-4400
> URL: https://issues.apache.org/jira/browse/TEZ-4400
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Eric Payne
> Priority: Minor
>
> Recently a lot of nodes ended up having their shuffle data wiped during an NM
> upgrade. It took many of the TEZ jobs far too long to recover. This should be
> something that can be quickly recovered. The NM is returning an error code
> indicating the shuffle data was not found, and that alone is sufficient
> evidence to know that no amount of retries is likely to fix the issue. As
> soon as the NM reports shuffle data as not found, the task should report the
> not found error to the AM and the AM should treat even a single not found
> error as sufficient cause to re-run the upstream task.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)