[ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13645741#comment-13645741 ]
Chris Riccomini commented on YARN-614: -------------------------------------- I've taken an initial stab at this. Looking for feedback. I added an ignoredFailures variable to RMAppImpl, which keeps a count of AM failures that should be ignored when figuring out whether to retry the AM with a new app attempt. Right now, the failures that are ignored are: DISK_FAILURE and ABORTED. Since the ignoredFailures variable is completely derivable from the app attempt state, I simply start the ignoredFailures at 0, and increment whenever a failure happens that should be ignored. When recover() is called on an app, we recover all attempts (and all of their justFinishedContainers), and then update the ignoredFailures variable accordingly. Potential areas for improvement: 1. Switch RMAppAttemptImpl to have a map of ContainerId to ContainerStatus, so we can do an O(1) lookup instead of traversing the justFinishedContainers list every time we want to look for the master container's status. 2. Add tests. 3. Add an shouldIgnoreFailure method in RMAppImpl, and move the DISK_FAILURE and ABORTED checks there. Any other thoughts? > Retry attempts automatically for hardware failures or YARN issues and set > default app retries to 1 > -------------------------------------------------------------------------------------------------- > > Key: YARN-614 > URL: https://issues.apache.org/jira/browse/YARN-614 > Project: Hadoop YARN > Issue Type: Improvement > Reporter: Bikas Saha > Attachments: YARN-614-0.patch > > > Attempts can fail due to a large number of user errors and they should not be > retried unnecessarily. The only reason YARN should retry an attempt is when > the hardware fails or YARN has an error. NM failing, lost NM and NM disk > errors are the hardware errors that come to mind. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira