[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4842:
----------------------------------

    Attachment: MAPREDUCE-4842-2.patch

In the interest of trying to push this forward faster, here's another version 
of Asokan's patch with the unit test from the original patch added.  I also 
implemented the removeFirst() instead of getFirst() change, and I fixed one 
more issue.  The last patch had a race regarding inProgress where startMerge() 
could set it to true, but a merge could be completing simultaneously and smash 
it back to false.  Then we'd run a merge without having inProgress as true 
during the merge, which is Not Good when it comes to getting the fetchers to 
try to wait when they should.

This patch does not implement the pipelining idea yet since the performance 
tests indicate that it might not be necessary to achieve equivalent 
performance.  Implementing it should be fairly straightforward.  For example, 
we could add a volatile mergeCount field that is incremented when merges 
complete.  waitForMerge() would cache the value in a local on entry and return 
when either inProgress is false or mergeCount has changed (i.e.: we are waiting 
for any active merge to complete, not all active merges).
                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842-2.patch, mapreduce-4842.patch, 
> mapreduce-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, 
> MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  
> It looked similar to the problem described in MAPREDUCE-3721, where the 
> fetchers were all being told to WAIT by the MergeManager but no merge was 
> taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to