[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Laxman updated MAPREDUCE-6351:
------------------------------
    Description: 
*Problem*
Reducer gets stuck in copy phase and doesn't make progress for very long time. 
After killing this task for couple of times manually, it gets completed. 

*Observations*
- Verfied gc logs. Found no memory related issues. Attached the logs.
- Verified thread dumps. Found no thread related problems. 
- On verification of logs, fetcher threads are not copying the map outputs and 
they are just waiting for merge to happen.
- Merge thread is alive and in wait state.

*Analysis* 
On careful observation of logs, thread dumps and code, this looks to me like a 
classic case of multi-threading issue. Thread goes to wait state after it has 
been notified. 

Here is the suspect code flow.
*Thread #1*
Fetcher thread - notification comes first
org.apache.hadoop.mapreduce.task.reduce.MergeThread.startMerge(Set<T>)
{code}
      synchronized(pendingToBeMerged) {
        pendingToBeMerged.addLast(toMergeInputs);
        pendingToBeMerged.notifyAll();
      }
{code}

*Thread #2*
Merge Thread - goes to wait state (Notification goes unconsumed)
org.apache.hadoop.mapreduce.task.reduce.MergeThread.run()
{code}
        synchronized (pendingToBeMerged) {
          while(pendingToBeMerged.size() <= 0) {
            pendingToBeMerged.wait();
          }
          // Pickup the inputs to merge.
          inputs = pendingToBeMerged.removeFirst();
        }
{code}


  was:
*Problem*
Reducer gets stuck in copy phase and doesn't make progress for very long time. 
After killing this task for couple of times manually, it gets completed. 

*Analysis*
- Verfied gc logs. Found no memory related issues. Attache
- Verified thread dumps. Found no thread related problems. 
- On verification of logs, fetcher threads are not copying the map outputs and 
they are just waiting for merge to happen.
- Merge thread is alive and in wait state.

On careful observation of logs, thread dumps and code, this looks to me like a 
classic case of multi-threading issue. Thread goes to wait state after it has 
been notified. 

Here is the suspect code flow.

*Thread #1*
Fetcher thread - notification comes first
org.apache.hadoop.mapreduce.task.reduce.MergeThread.startMerge(Set<T>)
{code}
      synchronized(pendingToBeMerged) {
        pendingToBeMerged.addLast(toMergeInputs);
        pendingToBeMerged.notifyAll();
      }
{code}

*Thread #2*
Merge Thread - goes to wait state (Notification goes unconsumed)
org.apache.hadoop.mapreduce.task.reduce.MergeThread.run()
{code}
        synchronized (pendingToBeMerged) {
          while(pendingToBeMerged.size() <= 0) {
            pendingToBeMerged.wait();
          }
          // Pickup the inputs to merge.
          inputs = pendingToBeMerged.removeFirst();
        }
{code}



> Reducer hung in copy phase.
> ---------------------------
>
>                 Key: MAPREDUCE-6351
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6351
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.6.0
>            Reporter: Laxman
>         Attachments: jstat-gc.log, reducer-container-partial.log.zip, 
> thread-dumps.out
>
>
> *Problem*
> Reducer gets stuck in copy phase and doesn't make progress for very long 
> time. After killing this task for couple of times manually, it gets 
> completed. 
> *Observations*
> - Verfied gc logs. Found no memory related issues. Attached the logs.
> - Verified thread dumps. Found no thread related problems. 
> - On verification of logs, fetcher threads are not copying the map outputs 
> and they are just waiting for merge to happen.
> - Merge thread is alive and in wait state.
> *Analysis* 
> On careful observation of logs, thread dumps and code, this looks to me like 
> a classic case of multi-threading issue. Thread goes to wait state after it 
> has been notified. 
> Here is the suspect code flow.
> *Thread #1*
> Fetcher thread - notification comes first
> org.apache.hadoop.mapreduce.task.reduce.MergeThread.startMerge(Set<T>)
> {code}
>       synchronized(pendingToBeMerged) {
>         pendingToBeMerged.addLast(toMergeInputs);
>         pendingToBeMerged.notifyAll();
>       }
> {code}
> *Thread #2*
> Merge Thread - goes to wait state (Notification goes unconsumed)
> org.apache.hadoop.mapreduce.task.reduce.MergeThread.run()
> {code}
>         synchronized (pendingToBeMerged) {
>           while(pendingToBeMerged.size() <= 0) {
>             pendingToBeMerged.wait();
>           }
>           // Pickup the inputs to merge.
>           inputs = pendingToBeMerged.removeFirst();
>         }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to