[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489534#comment-13489534
 ] 

Robert Joseph Evans commented on MAPREDUCE-4749:
------------------------------------------------

It has been a while for me in this code so feel free to correct me if I am 
wrong about something.  The changes look good to me but the TT is huge and I 
have not looked at it in that much depth.  Can there be multiple kill events 
for the same task or job?  If so allCleanupActions could be empty when there 
are still pending events.  I don't think this can happen, but I want to be sure 
about it.

I don't think isJobLocalising throws an InterruptedException. and the javadocs 
for that method are wrong.

My other comment would be about the wait and notify. In this patch you have 
changed the wait to be on the taskCleanupThread itself instead of rjob.  It 
appears the no one will ever notify the taskCleanupThread.  So please either 
change the wait to a sleep, or add in a call to taskCleanupThread.notifyAll() 
at about the same place that rjob.notifyAll() is happening. As part of that too 
you will need to synchronize with the taskCleanupThread before calling 
notifyAll.  You will probably also want to synchronize around the wait, but be 
careful so you get the locking order consistent between rjob and 
taskCleanupThread, or leave the two notify/lock pairs separate which might be 
simpler.    
                
> Killing multiple attempts of a task taker longer as more attempts are killed
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4749
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4749
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Arpit Gupta
>            Assignee: Arpit Gupta
>         Attachments: MAPREDUCE-4749.branch-1.patch, 
> MAPREDUCE-4749.branch-1.patch
>
>
> The following was noticed on a mr job running on hadoop 1.1.0
> 1. Start an mr job with 1 mapper
> 2. Wait for a min
> 3. Kill the first attempt of the mapper and then subsequently kill the other 
> 3 attempts in order to fail the job
> The time taken to kill the task grew exponentially.
> 1st attempt was killed immediately.
> 2nd attempt took a little over a min
> 3rd attempt took approx. 20 mins
> 4th attempt took around 3 hrs.
> The command used to kill the attempt was "hadoop job -fail-task"
> Note that the command returned immediately as soon as the fail attempt was 
> accepted but the time the attempt was actually killed was as stated above.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to