[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526636#comment-14526636
 ] 

Jason Lowe commented on MAPREDUCE-6351:
---------------------------------------

I suspect this is a duplicate of MAPREDUCE-6334.  I see a lot of these types of 
messages in the reducer log:
{noformat}
2015-05-01 19:59:37,632 WARN [fetcher#13] 
org.apache.hadoop.mapreduce.task.reduce.Fetcher: Shuffle output from 
glgs1190.grid.uh1.inmobi.com:13562 failed, retry it.
{noformat}

I think it is leaking memory allocations from the shuffle errors and the 
shuffle buffer runs out of available memory (hence fetchers told to WAIT) but 
there isn't enough data in the shuffle buffer to trigger a merge.  All of the 
memory that was leaked will never complete to kick off the merge and unblock 
the other threads.

> Reducer hung in copy phase.
> ---------------------------
>
>                 Key: MAPREDUCE-6351
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6351
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.6.0
>            Reporter: Laxman
>         Attachments: jstat-gc.log, reducer-container-partial.log.zip, 
> thread-dumps.out
>
>
> *Problem*
> Reducer gets stuck in copy phase and doesn't make progress for very long 
> time. After killing this task for couple of times manually, it gets 
> completed. 
> *Observations*
> - Verfied gc logs. Found no memory related issues. Attached the logs.
> - Verified thread dumps. Found no thread related problems. 
> - On verification of logs, fetcher threads are not copying the map outputs 
> and they are just waiting for merge to happen.
> - Merge thread is alive and in wait state.
> *Analysis* 
> On careful observation of logs, thread dumps and code, this looks to me like 
> a classic case of multi-threading issue. Thread goes to wait state after it 
> has been notified. 
> Here is the suspect code flow.
> *Thread #1*
> Fetcher thread - notification comes first
> org.apache.hadoop.mapreduce.task.reduce.MergeThread.startMerge(Set<T>)
> {code}
>       synchronized(pendingToBeMerged) {
>         pendingToBeMerged.addLast(toMergeInputs);
>         pendingToBeMerged.notifyAll();
>       }
> {code}
> *Thread #2*
> Merge Thread - goes to wait state (Notification goes unconsumed)
> org.apache.hadoop.mapreduce.task.reduce.MergeThread.run()
> {code}
>         synchronized (pendingToBeMerged) {
>           while(pendingToBeMerged.size() <= 0) {
>             pendingToBeMerged.wait();
>           }
>           // Pickup the inputs to merge.
>           inputs = pendingToBeMerged.removeFirst();
>         }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to