Chris Riccomini created YARN-611:
------------------------------------

             Summary: Add an AM retry count reset window to YARN RM
                 Key: YARN-611
                 URL: https://issues.apache.org/jira/browse/YARN-611
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.0.3-alpha
            Reporter: Chris Riccomini


YARN currently has the following config:

yarn.resourcemanager.am.max-retries

This config defaults to 2, and defines how many times to retry a "failed" AM 
before failing the whole YARN job. YARN counts an AM as failed if the node that 
it was running on dies (the NM will timeout, which counts as a failure for the 
AM), or if the AM dies.

This configuration is insufficient for long running (or infinitely running) 
YARN jobs, since the machine (or NM) that the AM is running on will eventually 
need to be restarted (or the machine/NM will fail). In such an event, the AM 
has not done anything wrong, but this is counted as a "failure" by the RM. 
Since the retry count for the AM is never reset, eventually, at some point, the 
number of machine/NM failures will result in the AM failure count going above 
the configured value for yarn.resourcemanager.am.max-retries. Once this 
happens, the RM will mark the job as failed, and shut it down. This behavior is 
not ideal.

I propose that we add a second configuration:

yarn.resourcemanager.am.retry-count-window-ms

This configuration would define a window of time that would define when an AM 
is "well behaved", and it's safe to reset its failure count back to zero. Every 
time an AM fails the RmAppImpl would check the last time that the AM failed. If 
the last failure was less than retry-count-window-ms ago, and the new failure 
count is > max-retries, then the job should fail. If the AM has never failed, 
the retry count is < max-retries, or if the last failure was OUTSIDE the 
retry-count-window-ms, then the job should be restarted. Additionally, if the 
last failure was outside the retry-count-window-ms, then the failure count 
should be set back to 0.

This would give developers a way to have well-behaved AMs run forever, while 
still failing mis-behaving AMs after a short period of time.

I think the work to be done here is to change the RmAppImpl to actually look at 
app.attempts, and see if there have been more than max-retries failures in the 
last retry-count-window-ms milliseconds. If there have, then the job should 
fail, if not, then the job should go forward. Additionally, we might also need 
to add an endTime in either RMAppAttemptImpl or RMAppFailedAttemptEvent, so 
that the RmAppImpl can check the time of the failure.

Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to