[jira] [Commented] (YARN-614) Separate AM failures from hardware failure or YARN error and do not count them to AM retry count

Xuan Gong (JIRA) Wed, 25 Jun 2014 16:45:07 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044184#comment-14044184
 ]


Xuan Gong commented on YARN-614:
--------------------------------

Ignore three type of failure with the following ContainerExistStatus:
* DISK_FAILURE
* ABORTED 
* KILL_BY_RESOURCEMANAGER

For ABORTED/KILL_BY_RESOURCEMANAGER:
* when the NMs are re-connection to RM, DeactivateNode or unHealthy node, all 
containers in those nodes will be stopped with ABORTED exist status. 
* or CONTAINER_EXPIRED
* or dropContainerReservation in RMContainerPreemptEvent
* or for all containers which are still alive or Reserved when 
ApplicationAttempt is done
* or all containers which are in release list when AM do the allocate call
* or all containers which are over-reserved when Scheduler process the 
nodeUpdate
* NMResync
* For some unknow containers
* For Unknown application

Most of scenarios will not happen in ApplicationMaster. But for those cases 
which might happen in ApplicationMaster container, I think that we can skip 
those failure and do not count them to AM retry count.
Please correct me if I miss something.

Also create a new patch which is no much difference from the previous one. But 
did not move the test case. Append the new test cases for easily review. Those 
test cases have some duplicate codes. will remove it after we finished the code 
review.

> Separate AM failures from hardware failure or YARN error and do not count 
> them to AM retry count
> ------------------------------------------------------------------------------------------------
>
>                 Key: YARN-614
>                 URL: https://issues.apache.org/jira/browse/YARN-614
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>            Assignee: Xuan Gong
>             Fix For: 2.5.0
>
>         Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch, 
> YARN-614-3.patch, YARN-614-4.patch, YARN-614-5.patch, YARN-614-6.patch, 
> YARN-614.7.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be 
> retried unnecessarily. The only reason YARN should retry an attempt is when 
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk 
> errors are the hardware errors that come to mind.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-614) Separate AM failures from hardware failure or YARN error and do not count them to AM retry count

Reply via email to