[
https://issues.apache.org/jira/browse/MAPREDUCE-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788192#comment-13788192
]
Jason Lowe commented on MAPREDUCE-5547:
---------------------------------------
I'm not sure we should try to enforce YARN failure == MR failure because I
don't think it's completely enforceable. The output committer is user code
that can do arbitrary things, including custom job end notification e.g.
FileOutputCommitter and the _SUCCESS file. As such there will always be cases
where downstream consumers of the job will think it succeeded and proceed as
normal despite what the RM says. In addition this change creates a couple of
new problems:
* The app can successfully unregister but fail to copy the history file, so now
we have a case where the RM says the job succeeded but the history server will
say ComeBackToMeLater until client times out. Would the history server no
longer have a quick way to say "I definitely don't know about that job"?
* We're starting to pile quite a few things into the grace period, and I'm
wondering if there will be enough time to get it all done if things aren't all
working properly. e.g.: slow network connection when trying to do job end
notification, slow datanode(s) when copying history file, etc. Deleting the
staging directory must be in the grace period to allow reattempts if we crash
before unregistering, but I'm not sure we need all this other stuff there as
well.
I want to make sure we're not causing more problems than we're solving.
Succeeding to perform job end notification and copy the history file but fail
to unregister should be a very rare instance, and even if it occurs it's likely
there will be a subsequent attempt that will be launched, read the previous
history file, realize the job succeeded, and unregister successfully. It's
only an issue if it also happens to be the last attempt unless I'm missing
something. Moving all of the MR-specific job end stuff to after we unregister
would be setting ourselves up for increasing the average fault visibility.
Anything that goes wrong during the grace period (e.g.: AM failure/crash) will
not be reattempted since the RM thinks the app is done, where it would have in
the current setup if there were attempts remaining. Given that anything in the
grace period is very fragile, I think we want to put as few things there as
possible.
Since jobs can indicate success to downstream consumers in ways we can't always
control, I think it would be better to embrace the fact that sometimes YARN
state != MR state and act accordingly. I think this only requires one change
to ClientServiceDelegate, as currently it assumes that a YARN state of FAILED
means the job failed. The client should redirect to the history server if the
app is in any terminal YARN state (i.e.: FINISHED/FAILED/KILLED) and only use
the YARN state as the job state if the history server doesn't know about the
job.
> Job history should not be flushed to JHS until AM gets unregistered
> -------------------------------------------------------------------
>
> Key: MAPREDUCE-5547
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5547
> Project: Hadoop Map/Reduce
> Issue Type: Sub-task
> Reporter: Zhijie Shen
> Assignee: Zhijie Shen
>
--
This message was sent by Atlassian JIRA
(v6.1#6144)