[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mariappan Asokan updated MAPREDUCE-4842: ---------------------------------------- Attachment: mapreduce-4842.patch Hi Jason, I have uploaded the patch with a caveat that it was not put to stress test:) You stated the following: {quote} We ran this patch through gridmix, and there are some indications it may negatively affect the performance of shuffle/merge for reducers. Not quite sure why, yet, as I haven't had time to investigate. Maybe since this patch checks for starting merges more often we end up starting merges too early and end up creating more work than if we wait for a fetcher to commit first? {quote} # Did you look at the log files to see the messages logged from {{startMerge()}} method in {{MergeThread}}? It tries to merge at most {{mergeFactor}} map outputs at a time. Do you see any differences in the messages with and without your patch since you are guessing that "we end up starting merges too early." # This is a tangent to point 1. The {{mergeFactor}} is set to the configured value for {{IntermediateMemoryToMemoryMerger}} but to Integer.MAX_VALUE for {{InMemoryMerger}} and {{OnDiskMerger.}} We have to find out the rationale behind these choices. # You are right that in my patch I did not make any change to the logic on when to start the merge. Let us compare the logs(with and without the patches) and go from there for any conclusions. Thanks for sharing the information. -- Asokan > Shuffle race can hang reducer > ----------------------------- > > Key: MAPREDUCE-4842 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 2.0.2-alpha, 0.23.5 > Reporter: Jason Lowe > Assignee: Jason Lowe > Priority: Blocker > Attachments: mapreduce-4842.patch, MAPREDUCE-4842.patch, > MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch > > > Saw an instance where the shuffle caused multiple reducers in a job to hang. > It looked similar to the problem described in MAPREDUCE-3721, where the > fetchers were all being told to WAIT by the MergeManager but no merge was > taking place. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira