[ 
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648604#comment-13648604
 ] 

Chris Riccomini commented on YARN-614:
--------------------------------------

I reverted the List->Map change, getIgnoredFailures, and changes to MockAsm and 
MockRMApp, per your request. I also moved all of the "get container status" 
stuff into the updateFailureCount method, per your request.

Agreed, let's punt on recover. Removed all recovery code. Now, ignoredFailures 
is only used in AttemptFailedTransition.

I believe aborted does imply lost. Per the docs:

{code}
  /**
   * Containers killed by the framework, either due to being released by
   * the application or being 'lost' due to node failures etc.
   */
  public static final int ABORTED = -100;
{code}

I poked around a bit in the code, and it seems that ABORTED is used as a 
container status when:

1. The node is lost.
2. The scheduler kills off (preempts) a container due to capacity limits.
3. The AM releases one of its containers.

Only 1-2 can happen with an AM, since only YARN can release the AM, itself. 
Neither of these are the AM's fault, so I think it's fine to increment 
ignoredFailures when we see ABORTED. If you want to split the failure types, I 
think another Jira is the best place to do that.

TODO:

1. We need to check if the "justFinished" containers would always have an entry 
for the master container. Specially the case when the node is lost because it 
went down.
2. Tests.
                
> Retry attempts automatically for hardware failures or YARN issues and set 
> default app retries to 1
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-614
>                 URL: https://issues.apache.org/jira/browse/YARN-614
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>         Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be 
> retried unnecessarily. The only reason YARN should retry an attempt is when 
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk 
> errors are the hardware errors that come to mind.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to