[ 
https://issues.apache.org/jira/browse/HADOOP-2639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amar Kamat updated HADOOP-2639:
-------------------------------

    Attachment: HADOOP-2639.patch

Finally found the bug that causes this effect. HADOOP-2247 introduced a new 
strategy for killing the maps i.e kill the map if 
{{(fetch-failure-notification/num-running-reduce-tasks) > 0.5}}. It seems that 
{{num-running-reduce-tasks}} can achieve negative value thus breaking the 
overall strategy and stalling the whole job by not killing the maps. This is 
because the reduce count is incremented if the TIP is not running and 
decremented for every task in the TIP. Providing a patch that addresses this 
issue by incrementing the counter for every task that gets scheduled. 

> Reducers stuck in shuffle
> -------------------------
>
>                 Key: HADOOP-2639
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2639
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amareshwari Sri Ramadasu
>            Assignee: Amar Kamat
>            Priority: Blocker
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-2639.patch
>
>
> I started sort benchmark on 500 nodes. It has 40000 maps and 900 reducers.
> There are 11 reducers stuck in shuffle with 33% progress. I could see a node 
> down which ran 80 maps on it. And all these reducers are trying to fetch map 
> output from that node. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to