[ https://issues.apache.org/jira/browse/MAPREDUCE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siddharth Seth updated MAPREDUCE-3721: -------------------------------------- Attachment: MR-3721.txt Patch adds another variable - commitMemory (Fetch complete size). A merge is triggered only if this size exceeds mergeThreshold. Added a check to ensure mergeThreshold is greater than the maxSingleShuffleLimit. Earlier - usedMemory (reserved) was used for this computation - which meant a single segment way below mergeThreshold could lead to a merge to disk. Have run several gridmix runs with the patch applied - without a hang. Not including a unit test - writing one would likely change way more in the shuffle code to be able to recreate the scenario. > Race in shuffle can cause it to hang > ------------------------------------ > > Key: MAPREDUCE-3721 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3721 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 0.23.0 > Reporter: Siddharth Seth > Assignee: Siddharth Seth > Priority: Blocker > Attachments: MR-3721.txt > > > If all current {{Fetcher}}s complete while an in-memory merge is in progress > - shuffle could hang. > Specifically - if the memory freed by an in-memory merge does not bring > {{MergeManager.usedMemory}} below {{MergeManager.memoryLimit}} and all > current Fetchers complete before the in-memory merge completes, another > in-memory merge will not be triggered - and shuffle will hang. (All new > fetchers are asked to WAIT). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira