[ 
https://issues.apache.org/jira/browse/HADOOP-2639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12562818#action_12562818
 ] 

Devaraj Das commented on HADOOP-2639:
-------------------------------------

I am missing something I think - isn't runningMapTasks/runningReduceTasks 
supposed to include speculative tasks also? This patch increments those values 
for speculative tasks also. If it is not supposed to increment the values for 
new speculative tasks, then we should fix those places where the values are 
decremented. This is because the decrement happens even for killed tasks (and 
this fix is more involved). Assume this scenario with just one map task in the 
job in the order of time (in the current codebase without the patch):
1) obtainNewMapTask returns a task - runningMapTasks becomes 1
2) obtainNewMapTask gets called again - this time a speculative task is 
returned, and runningMapTasks is not incremented since _wasRunning_ is true
3) The first task completes - runningMapTasks becomes 0 (in _completedTask_)
4) The speculative task is killed - runningMapTasks becomes -1 (in 
_failedTask_) --- this is *bad!*

I don't see a good reason behind why we should have 
runningMapTasks/runningReduceTasks not count speculative tasks. Did code 
analysis and didn't seem to find any place where this would have any effect on 
the existing semantics.

[A comment on the patch - the patch should check whether the _task_ returned by 
_getTaskToRun_ is non-null and then increment the 
runningMapTasks/runningReduceTasks.]

> Reducers stuck in shuffle
> -------------------------
>
>                 Key: HADOOP-2639
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2639
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amareshwari Sri Ramadasu
>            Assignee: Amar Kamat
>            Priority: Blocker
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-2639.patch
>
>
> I started sort benchmark on 500 nodes. It has 40000 maps and 900 reducers.
> There are 11 reducers stuck in shuffle with 33% progress. I could see a node 
> down which ran 80 maps on it. And all these reducers are trying to fetch map 
> output from that node. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to