[jira] [Commented] (MAPREDUCE-5502) History link in resource manager is broken for KILLED jobs

Jason Lowe (JIRA) Wed, 18 Sep 2013 13:46:29 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771216#comment-13771216
 ]


Jason Lowe commented on MAPREDUCE-5502:
---------------------------------------

Hmm, something must be wrong with the mapred client then as it explicitly 
checks with the RM to see if the application is running and if so, tries to 
connect to the AM to kill it.

Looking deeper, it may be this code in YARNRunner.killJob:

{code}
    /* check if the status is not running, if not send kill to RM */
    JobStatus status = clientCache.getClient(arg0).getJobStatus(arg0);
    if (status.getState() != JobStatus.State.RUNNING) {
      try {
        resMgrDelegate.killApplication(TypeConverter.toYarn(arg0).getAppId());
      } catch (YarnException e) {
        throw new IOException(e);
      }
      return;
    }
{code}

So in this scenario the AM has finished the job but not unregistered yet.  AM 
is telling clients that connect to it that the job status is 
SUCCEEDED/FAILED/KILLED (i.e.: not RUNNING but in some terminal state) but the 
AM has yet to unregister with the RM so the RM is directing clients to the AM 
when asked.  If the RM kills the app I think there's not a lot of options for 
getting history consistently per the discussion above.

We could fix this particular scenario by having YARNRunner not try to kill the 
application if the reported status is already a terminal state.  There's the 
risk of an insane AM that thinks the job is completed and continues to report 
that but refuses to unregister from the RM.  mapred job -kill would then be 
ineffective at killing such an application.  Seems an unlikely scenario in 
practice, and there's always yarn -kill as a workaround if it did happen.

MAPREDUCE-5497 probably made the race window for this scenario very small in 
practice, as it no longer waits 5 seconds after the job completes before 
unregistering.
                
> History link in resource manager is broken for KILLED jobs
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-5502
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5502
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.5-alpha
>            Reporter: Vrushali C
>            Assignee: Vrushali C
>              Labels: ui
>
> History link in resource manager is broken for KILLED jobs.
> Seems to happen with jobs with State 'KILLED' and FinalStatus 'KILLED'. If 
> the State is 'FINISHED' and FinalStatus is 'KILLED', then the "History" link 
> is fine.
> It isn't easy to reproduce the problem since the time at which the app is 
> killed determines the state it ends up in, which is hard to guess. these 
> particular jobs seem to get a Diagnostics message of "Application killed by 
> user." where as the other killed jobs get " Kill Job received from client 
> job_1378766187901_0002
> Job received Kill while in RUNNING state. "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-5502) History link in resource manager is broken for KILLED jobs

Reply via email to