[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Gupta updated MAPREDUCE-4749:
-----------------------------------

    Attachment: MAPREDUCE-4749.branch-1.patch

bq. Can there be multiple kill events for the same task or job? If so 
allCleanupActions could be empty when there are still pending events. I don't 
think this can happen, but I want to be sure about it.

allCleanUpActions is populated when ever user adds item to the queue and only 
gets removed when a task/job is killed. So even when you are left with just 
tainted tasks it wont be empty.

And yes we could have multiple kill events for the same task/job before this 
patch, now we make sure that is not the case.

bq. I don't think isJobLocalising throws an InterruptedException. and the 
javadocs for that method are wrong.

Updated the javadoc and removed the throws exception.

bq. My other comment would be about the wait and notify. In this patch you have 
changed the wait to be on the taskCleanupThread itself instead of rjob. It 
appears the no one will ever notify the taskCleanupThread. So please either 
change the wait to a sleep, or add in a call to taskCleanupThread.notifyAll() 
at about the same place that rjob.notifyAll() is happening. As part of that too 
you will need to synchronize with the taskCleanupThread before calling 
notifyAll. You will probably also want to synchronize around the wait, but be 
careful so you get the locking order consistent between rjob and 
taskCleanupThread, or leave the two notify/lock pairs separate which might be 
simpler.

I decided to change the wait to a Thread.sleep in the task clean up and removed 
the rjob.notifyAll


Also I had to restructure the code a bit so that i could write unit tests to 
cover various scenarios.
                
> Killing multiple attempts of a task taker longer as more attempts are killed
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4749
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4749
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Arpit Gupta
>            Assignee: Arpit Gupta
>         Attachments: MAPREDUCE-4749.branch-1.patch, 
> MAPREDUCE-4749.branch-1.patch, MAPREDUCE-4749.branch-1.patch
>
>
> The following was noticed on a mr job running on hadoop 1.1.0
> 1. Start an mr job with 1 mapper
> 2. Wait for a min
> 3. Kill the first attempt of the mapper and then subsequently kill the other 
> 3 attempts in order to fail the job
> The time taken to kill the task grew exponentially.
> 1st attempt was killed immediately.
> 2nd attempt took a little over a min
> 3rd attempt took approx. 20 mins
> 4th attempt took around 3 hrs.
> The command used to kill the attempt was "hadoop job -fail-task"
> Note that the command returned immediately as soon as the fail attempt was 
> accepted but the time the attempt was actually killed was as stated above.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to