[ 
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13645741#comment-13645741
 ] 

Chris Riccomini commented on YARN-614:
--------------------------------------

I've taken an initial stab at this. Looking for feedback. I added an 
ignoredFailures variable to RMAppImpl, which keeps a count of AM failures that 
should be ignored when figuring out whether to retry the AM with a new app 
attempt. Right now, the failures that are ignored are: DISK_FAILURE and 
ABORTED. Since the ignoredFailures variable is completely derivable from the 
app attempt state, I simply start the ignoredFailures at 0, and increment 
whenever a failure happens that should be ignored. When recover() is called on 
an app, we recover all attempts (and all of their justFinishedContainers), and 
then update the ignoredFailures variable accordingly.

Potential areas for improvement:

1. Switch RMAppAttemptImpl to have a map of ContainerId to ContainerStatus, so 
we can do an O(1) lookup instead of traversing the justFinishedContainers list 
every time we want to look for the master container's status.
2. Add tests.
3. Add an shouldIgnoreFailure method in RMAppImpl, and move the DISK_FAILURE 
and ABORTED checks there.

Any other thoughts?
                
> Retry attempts automatically for hardware failures or YARN issues and set 
> default app retries to 1
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-614
>                 URL: https://issues.apache.org/jira/browse/YARN-614
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>         Attachments: YARN-614-0.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be 
> retried unnecessarily. The only reason YARN should retry an attempt is when 
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk 
> errors are the hardware errors that come to mind.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to