[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105828#comment-14105828 ]
Xuan Gong commented on YARN-611: -------------------------------- this patch addresses other comments from Steve > Add an AM retry count reset window to YARN RM > --------------------------------------------- > > Key: YARN-611 > URL: https://issues.apache.org/jira/browse/YARN-611 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.0.3-alpha > Reporter: Chris Riccomini > Assignee: Xuan Gong > Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, > YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch > > > YARN currently has the following config: > yarn.resourcemanager.am.max-retries > This config defaults to 2, and defines how many times to retry a "failed" AM > before failing the whole YARN job. YARN counts an AM as failed if the node > that it was running on dies (the NM will timeout, which counts as a failure > for the AM), or if the AM dies. > This configuration is insufficient for long running (or infinitely running) > YARN jobs, since the machine (or NM) that the AM is running on will > eventually need to be restarted (or the machine/NM will fail). In such an > event, the AM has not done anything wrong, but this is counted as a "failure" > by the RM. Since the retry count for the AM is never reset, eventually, at > some point, the number of machine/NM failures will result in the AM failure > count going above the configured value for > yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the > job as failed, and shut it down. This behavior is not ideal. > I propose that we add a second configuration: > yarn.resourcemanager.am.retry-count-window-ms > This configuration would define a window of time that would define when an AM > is "well behaved", and it's safe to reset its failure count back to zero. > Every time an AM fails the RmAppImpl would check the last time that the AM > failed. If the last failure was less than retry-count-window-ms ago, and the > new failure count is > max-retries, then the job should fail. If the AM has > never failed, the retry count is < max-retries, or if the last failure was > OUTSIDE the retry-count-window-ms, then the job should be restarted. > Additionally, if the last failure was outside the retry-count-window-ms, then > the failure count should be set back to 0. > This would give developers a way to have well-behaved AMs run forever, while > still failing mis-behaving AMs after a short period of time. > I think the work to be done here is to change the RmAppImpl to actually look > at app.attempts, and see if there have been more than max-retries failures in > the last retry-count-window-ms milliseconds. If there have, then the job > should fail, if not, then the job should go forward. Additionally, we might > also need to add an endTime in either RMAppAttemptImpl or > RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the > failure. > Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)