[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537096#comment-13537096
 ] 

Jason Lowe commented on MAPREDUCE-4842:
---------------------------------------

Regarding the removal of the element in the finally block, I'm not sure why 
we're waiting until after merging to remove the element from the list.  The 
list is private, nobody should be trying to examine/walk it mid-merge, and it 
seems much simpler to dequeue the element being processed before processing 
rather than waiting until the end.  Basically pendingToBeMerged.getFirst() 
becomes pendingToBeMerged.removeFirst() and we don't need to remember to remove 
it in the finally block.

Speaking of the finally block, I'm also curious if we really want to only 
notify others of the merge completing if there are no further merges pending.  
Arguably we should wake them up as soon as any merge completes, as it did 
previously, because usedMemory should have been lowered during the merge and 
would allow more shuffle data to be fetched into memory.  Waiting until there 
are no more merges pending means we can't pipeline the shuffle data fetch with 
ongoing merges if all the fetchers are waiting for the merge so memory can be 
freed.  Waking up waiters on any merge completion means we don't need to lock 
pendingToBeMerged at all in the finally block (once we also make the change 
suggested above) and the finally block reverts to what it was originally.

                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: mapreduce-4842.patch, mapreduce-4842.patch, 
> MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, 
> MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  
> It looked similar to the problem described in MAPREDUCE-3721, where the 
> fetchers were all being told to WAIT by the MergeManager but no merge was 
> taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to