[ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648604#comment-13648604 ]
Chris Riccomini commented on YARN-614: -------------------------------------- I reverted the List->Map change, getIgnoredFailures, and changes to MockAsm and MockRMApp, per your request. I also moved all of the "get container status" stuff into the updateFailureCount method, per your request. Agreed, let's punt on recover. Removed all recovery code. Now, ignoredFailures is only used in AttemptFailedTransition. I believe aborted does imply lost. Per the docs: {code} /** * Containers killed by the framework, either due to being released by * the application or being 'lost' due to node failures etc. */ public static final int ABORTED = -100; {code} I poked around a bit in the code, and it seems that ABORTED is used as a container status when: 1. The node is lost. 2. The scheduler kills off (preempts) a container due to capacity limits. 3. The AM releases one of its containers. Only 1-2 can happen with an AM, since only YARN can release the AM, itself. Neither of these are the AM's fault, so I think it's fine to increment ignoredFailures when we see ABORTED. If you want to split the failure types, I think another Jira is the best place to do that. TODO: 1. We need to check if the "justFinished" containers would always have an entry for the master container. Specially the case when the node is lost because it went down. 2. Tests. > Retry attempts automatically for hardware failures or YARN issues and set > default app retries to 1 > -------------------------------------------------------------------------------------------------- > > Key: YARN-614 > URL: https://issues.apache.org/jira/browse/YARN-614 > Project: Hadoop YARN > Issue Type: Improvement > Reporter: Bikas Saha > Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch > > > Attempts can fail due to a large number of user errors and they should not be > retried unnecessarily. The only reason YARN should retry an attempt is when > the hardware fails or YARN has an error. NM failing, lost NM and NM disk > errors are the hardware errors that come to mind. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira