[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mariappan Asokan updated MAPREDUCE-4842:
----------------------------------------

    Attachment: mapreduce-4842.patch

Hi Jason,
 I have uploaded the patch with a caveat that it was not put to stress test:)

You stated the following:
{quote}
We ran this patch through gridmix, and there are some indications it may 
negatively affect the performance of shuffle/merge for reducers. Not quite sure 
why, yet, as I haven't had time to investigate. Maybe since this patch checks 
for starting merges more often we end up starting merges too early and end up 
creating more work than if we wait for a fetcher to commit first?
{quote}

# Did you look at the log files to see the messages logged from 
{{startMerge()}} method in {{MergeThread}}? It tries to merge at most 
{{mergeFactor}} map outputs at a time. Do you see any differences in the 
messages with and without your patch since you are guessing that "we end up 
starting merges too early."

# This is a tangent to point 1. The {{mergeFactor}} is set to the configured 
value for {{IntermediateMemoryToMemoryMerger}} but to Integer.MAX_VALUE for 
{{InMemoryMerger}} and {{OnDiskMerger.}} We have to find out the rationale 
behind these choices.

# You are right that in my patch I did not make any change to the logic on when 
to start the merge.

Let us compare the logs(with and without the patches) and go from there for any 
conclusions.

Thanks for sharing the information.

-- Asokan
                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: mapreduce-4842.patch, MAPREDUCE-4842.patch, 
> MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  
> It looked similar to the problem described in MAPREDUCE-3721, where the 
> fetchers were all being told to WAIT by the MergeManager but no merge was 
> taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to