[ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe updated MAPREDUCE-4842: ---------------------------------- Attachment: MAPREDUCE-4842-2.patch In the interest of trying to push this forward faster, here's another version of Asokan's patch with the unit test from the original patch added. I also implemented the removeFirst() instead of getFirst() change, and I fixed one more issue. The last patch had a race regarding inProgress where startMerge() could set it to true, but a merge could be completing simultaneously and smash it back to false. Then we'd run a merge without having inProgress as true during the merge, which is Not Good when it comes to getting the fetchers to try to wait when they should. This patch does not implement the pipelining idea yet since the performance tests indicate that it might not be necessary to achieve equivalent performance. Implementing it should be fairly straightforward. For example, we could add a volatile mergeCount field that is incremented when merges complete. waitForMerge() would cache the value in a local on entry and return when either inProgress is false or mergeCount has changed (i.e.: we are waiting for any active merge to complete, not all active merges). > Shuffle race can hang reducer > ----------------------------- > > Key: MAPREDUCE-4842 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 2.0.2-alpha, 0.23.5 > Reporter: Jason Lowe > Assignee: Jason Lowe > Priority: Blocker > Attachments: MAPREDUCE-4842-2.patch, mapreduce-4842.patch, > mapreduce-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, > MAPREDUCE-4842.patch, MAPREDUCE-4842.patch > > > Saw an instance where the shuffle caused multiple reducers in a job to hang. > It looked similar to the problem described in MAPREDUCE-3721, where the > fetchers were all being told to WAIT by the MergeManager but no merge was > taking place. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira