[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13512121#comment-13512121
 ] 

Jason Lowe commented on MAPREDUCE-4842:
---------------------------------------

Unfortunately no, I don't have an easy repro case.  This is something I noticed 
happened to a job someone was running on one of our clusters.  It's a race 
condition between fetchers and merging, and I'm not sure even with the same 
cluster config and job it will easily reproduce.

We ran this patch through gridmix, and there are some indications it may 
negatively affect the performance of shuffle/merge for reducers.  Not quite 
sure why, yet, as I haven't had time to investigate.  Maybe since this patch 
checks for starting merges more often we end up starting merges too early and 
end up creating more work than if we wait for a fetcher to commit first?  One 
idea I wanted to try is to change the patch to only trigger a merge after a 
merge completes if we're convinced there are no outstanding fetchers that would 
trigger it later (e.g.: only trigger if merge conditions are met and 
commitMemory == usedMemory, IIRC).

                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, 
> MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  
> It looked similar to the problem described in MAPREDUCE-3721, where the 
> fetchers were all being told to WAIT by the MergeManager but no merge was 
> taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to