[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071113#comment-15071113 ]
Jun Gong commented on YARN-3998: -------------------------------- Sorry for late. I just attached a patch for review, will add test cases later. In the patch, I add *ContainerRetry* to specify container’s retry strategy when container fails to run, *ContainerRetry* includes 4 fields: * *ContainerRetryPolicy*: it has three policies: *NEVER_RETRY*(no matter what error code is when container fails to run, just do not retry), *ALWAYS_RETRY*(no matter what error code is, when container fails to run, just retry), *RETRY_ON_SPECIFIC_ERROR_CODE*(when container fails to run, do retry if the error code is one of errorCodes, otherwise do not retry). * *errorCodes* is described as above. * *retryTimes* specifies how many times to retry if need to retry, if the value is -1, it means retrying forever. * *retryInterval* specifies delaying some time before relaunch container, the unit is seconds. I store container's remain retry times to NMStateStore to keep it across NM restart. And I modified distributed shell to support container retry, it helps a lot for testing. > Add retry-times to let NM re-launch container when it fails to run > ------------------------------------------------------------------ > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: New Feature > Reporter: Jun Gong > Assignee: Jun Gong > Attachments: YARN-3998.01.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)