[ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042910#comment-14042910 ]
Steve Loughran commented on YARN-614: ------------------------------------- I like this, but need to note one thing: our AM has a "suicide <delay>" IPC method which we use for testing AM failure -we tell the AM to kill itself and then YARN brings it up somewhere else. It's essential that -somehow- I can replicate this behavior on live clusters. Is there a way to do it here? Perhaps an exit code from the AM that says "please restart". That would also allow live AMs to trigger a restart if they actually felt they were in a bad way > Retry attempts automatically for hardware failures or YARN issues and set > default app retries to 1 > -------------------------------------------------------------------------------------------------- > > Key: YARN-614 > URL: https://issues.apache.org/jira/browse/YARN-614 > Project: Hadoop YARN > Issue Type: Improvement > Reporter: Bikas Saha > Assignee: Xuan Gong > Fix For: 2.5.0 > > Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch, > YARN-614-3.patch, YARN-614-4.patch, YARN-614-5.patch, YARN-614-6.patch, > YARN-614.7.patch > > > Attempts can fail due to a large number of user errors and they should not be > retried unnecessarily. The only reason YARN should retry an attempt is when > the hardware fails or YARN has an error. NM failing, lost NM and NM disk > errors are the hardware errors that come to mind. -- This message was sent by Atlassian JIRA (v6.2#6252)