[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788192#comment-13788192
 ] 

Jason Lowe commented on MAPREDUCE-5547:
---------------------------------------

I'm not sure we should try to enforce YARN failure == MR failure because I 
don't think it's completely enforceable.  The output committer is user code 
that can do arbitrary things, including custom job end notification e.g. 
FileOutputCommitter and the _SUCCESS file.  As such there will always be cases 
where downstream consumers of the job will think it succeeded and proceed as 
normal despite what the RM says.  In addition this change creates a couple of 
new problems:

* The app can successfully unregister but fail to copy the history file, so now 
we have a case where the RM says the job succeeded but the history server will 
say ComeBackToMeLater until client times out.  Would the history server no 
longer have a quick way to say "I definitely don't know about that job"?
* We're starting to pile quite a few things into the grace period, and I'm 
wondering if there will be enough time to get it all done if things aren't all 
working properly.  e.g.: slow network connection when trying to do job end 
notification, slow datanode(s) when copying history file, etc.  Deleting the 
staging directory must be in the grace period to allow reattempts if we crash 
before unregistering, but I'm not sure we need all this other stuff there as 
well.

I want to make sure we're not causing more problems than we're solving.  
Succeeding to perform job end notification and copy the history file but fail 
to unregister should be a very rare instance, and even if it occurs it's likely 
there will be a subsequent attempt that will be launched, read the previous 
history file, realize the job succeeded, and unregister successfully.  It's 
only an issue if it also happens to be the last attempt unless I'm missing 
something.  Moving all of the MR-specific job end stuff to after we unregister 
would be setting ourselves up for increasing the average fault visibility.  
Anything that goes wrong during the grace period (e.g.: AM failure/crash) will 
not be reattempted since the RM thinks the app is done, where it would have in 
the current setup if there were attempts remaining.  Given that anything in the 
grace period is very fragile, I think we want to put as few things there as 
possible.

Since jobs can indicate success to downstream consumers in ways we can't always 
control, I think it would be better to embrace the fact that sometimes YARN 
state != MR state and act accordingly.  I think this only requires one change 
to ClientServiceDelegate, as currently it assumes that a YARN state of FAILED 
means the job failed.  The client should redirect to the history server if the 
app is in any terminal YARN state (i.e.: FINISHED/FAILED/KILLED) and only use 
the YARN state as the job state if the history server doesn't know about the 
job.

> Job history should not be flushed to JHS until AM gets unregistered
> -------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5547
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5547
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>




--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to