[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13512121#comment-13512121 ]
Jason Lowe commented on MAPREDUCE-4842: --------------------------------------- Unfortunately no, I don't have an easy repro case. This is something I noticed happened to a job someone was running on one of our clusters. It's a race condition between fetchers and merging, and I'm not sure even with the same cluster config and job it will easily reproduce. We ran this patch through gridmix, and there are some indications it may negatively affect the performance of shuffle/merge for reducers. Not quite sure why, yet, as I haven't had time to investigate. Maybe since this patch checks for starting merges more often we end up starting merges too early and end up creating more work than if we wait for a fetcher to commit first? One idea I wanted to try is to change the patch to only trigger a merge after a merge completes if we're convinced there are no outstanding fetchers that would trigger it later (e.g.: only trigger if merge conditions are met and commitMemory == usedMemory, IIRC). > Shuffle race can hang reducer > ----------------------------- > > Key: MAPREDUCE-4842 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 2.0.2-alpha, 0.23.5 > Reporter: Jason Lowe > Assignee: Jason Lowe > Priority: Blocker > Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, > MAPREDUCE-4842.patch, MAPREDUCE-4842.patch > > > Saw an instance where the shuffle caused multiple reducers in a job to hang. > It looked similar to the problem described in MAPREDUCE-3721, where the > fetchers were all being told to WAIT by the MergeManager but no merge was > taking place. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira