[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506328#comment-13506328
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-4813:
----------------------------------------------------

Some comments on the patch:
 - Similar to JobCommitFailedEvent, add an event class for JOB_COMMIT_COMPLETED.
 - JobImpl.checkJobCompleteSuccess() and corresponding return variables should 
be renamed to mean checkIfJobReadyForCommit(). Similary, 
checkJobForCompletion(job).
 - For now, we may be just be addressing MAPREDUCE-4815, but the same argument 
of committer being arbitrary user code is valid for other calls like abortJob, 
setupJob too. We will need states capturing those calls and put them on 
separate threads so that dispatches isn't blocked. We can do that later, but to 
be future-proof, let's move the committer-thread to a top-level service ala 
TaskCleaner. We may even re-purpose TaskCleanerImpl for this. Scope the effort 
and split it as you see fit.
 - Commit-thread interrupting and joining is only meaning-ful in the case of 
kill-during-commit. So let's move that code there. Also, earlier, we never 
supported kill-during-commit, but now we do and the patch is putting a 60second 
upper bound on commitJob() before abortJob(). Comparing this with 1.*, we do 
allow kill-during-commit as commit happens in a separate JVM. So interrupt and 
join seems fine, let's just put in a config so that we can tweak if ever there 
is a need.
 - The test looks good. Can you extend it to include kill-during-commit too. 
That will also validate that the dispatcher isn't blocked anymore because of 
long commit.
                
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to