[jira] [Commented] (MAPREDUCE-6351) Reducer hung in copy phase.

2015-05-05 Thread Laxman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14528154#comment-14528154
 ] 

Laxman commented on MAPREDUCE-6351:
---

Thanks a lot Jason for details. We are hitting exactly same scenario (disk bad) 
as explained in MAPREDUCE-6334.
We will try the patch and update the details in this jira.



> Reducer hung in copy phase.
> ---
>
> Key: MAPREDUCE-6351
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6351
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 2.6.0
>Reporter: Laxman
> Attachments: jstat-gc.log, reducer-container-partial.log.zip, 
> thread-dumps.out
>
>
> *Problem*
> Reducer gets stuck in copy phase and doesn't make progress for very long 
> time. After killing this task for couple of times manually, it gets 
> completed. 
> *Observations*
> - Verfied gc logs. Found no memory related issues. Attached the logs.
> - Verified thread dumps. Found no thread related problems. 
> - On verification of logs, fetcher threads are not copying the map outputs 
> and they are just waiting for merge to happen.
> - Merge thread is alive and in wait state.
> *Analysis* 
> On careful observation of logs, thread dumps and code, this looks to me like 
> a classic case of multi-threading issue. Thread goes to wait state after it 
> has been notified. 
> Here is the suspect code flow.
> *Thread #1*
> Fetcher thread - notification comes first
> org.apache.hadoop.mapreduce.task.reduce.MergeThread.startMerge(Set)
> {code}
>   synchronized(pendingToBeMerged) {
> pendingToBeMerged.addLast(toMergeInputs);
> pendingToBeMerged.notifyAll();
>   }
> {code}
> *Thread #2*
> Merge Thread - goes to wait state (Notification goes unconsumed)
> org.apache.hadoop.mapreduce.task.reduce.MergeThread.run()
> {code}
> synchronized (pendingToBeMerged) {
>   while(pendingToBeMerged.size() <= 0) {
> pendingToBeMerged.wait();
>   }
>   // Pickup the inputs to merge.
>   inputs = pendingToBeMerged.removeFirst();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6351) Reducer hung in copy phase.

2015-05-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526636#comment-14526636
 ] 

Jason Lowe commented on MAPREDUCE-6351:
---

I suspect this is a duplicate of MAPREDUCE-6334.  I see a lot of these types of 
messages in the reducer log:
{noformat}
2015-05-01 19:59:37,632 WARN [fetcher#13] 
org.apache.hadoop.mapreduce.task.reduce.Fetcher: Shuffle output from 
glgs1190.grid.uh1.inmobi.com:13562 failed, retry it.
{noformat}

I think it is leaking memory allocations from the shuffle errors and the 
shuffle buffer runs out of available memory (hence fetchers told to WAIT) but 
there isn't enough data in the shuffle buffer to trigger a merge.  All of the 
memory that was leaked will never complete to kick off the merge and unblock 
the other threads.

> Reducer hung in copy phase.
> ---
>
> Key: MAPREDUCE-6351
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6351
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 2.6.0
>Reporter: Laxman
> Attachments: jstat-gc.log, reducer-container-partial.log.zip, 
> thread-dumps.out
>
>
> *Problem*
> Reducer gets stuck in copy phase and doesn't make progress for very long 
> time. After killing this task for couple of times manually, it gets 
> completed. 
> *Observations*
> - Verfied gc logs. Found no memory related issues. Attached the logs.
> - Verified thread dumps. Found no thread related problems. 
> - On verification of logs, fetcher threads are not copying the map outputs 
> and they are just waiting for merge to happen.
> - Merge thread is alive and in wait state.
> *Analysis* 
> On careful observation of logs, thread dumps and code, this looks to me like 
> a classic case of multi-threading issue. Thread goes to wait state after it 
> has been notified. 
> Here is the suspect code flow.
> *Thread #1*
> Fetcher thread - notification comes first
> org.apache.hadoop.mapreduce.task.reduce.MergeThread.startMerge(Set)
> {code}
>   synchronized(pendingToBeMerged) {
> pendingToBeMerged.addLast(toMergeInputs);
> pendingToBeMerged.notifyAll();
>   }
> {code}
> *Thread #2*
> Merge Thread - goes to wait state (Notification goes unconsumed)
> org.apache.hadoop.mapreduce.task.reduce.MergeThread.run()
> {code}
> synchronized (pendingToBeMerged) {
>   while(pendingToBeMerged.size() <= 0) {
> pendingToBeMerged.wait();
>   }
>   // Pickup the inputs to merge.
>   inputs = pendingToBeMerged.removeFirst();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6351) Reducer hung in copy phase.

2015-05-04 Thread Laxman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526438#comment-14526438
 ] 

Laxman commented on MAPREDUCE-6351:
---

"Threads analysis" mentioned in description above found to be incorrect when I 
retrace the code flow. Pre-notification is not a problem as merger wait is 
guarded by size check.

However, problem exists, fetchers are not proceeding and waiting for merger to 
free some memory and merge doing nothing.

> Reducer hung in copy phase.
> ---
>
> Key: MAPREDUCE-6351
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6351
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 2.6.0
>Reporter: Laxman
> Attachments: jstat-gc.log, reducer-container-partial.log.zip, 
> thread-dumps.out
>
>
> *Problem*
> Reducer gets stuck in copy phase and doesn't make progress for very long 
> time. After killing this task for couple of times manually, it gets 
> completed. 
> *Observations*
> - Verfied gc logs. Found no memory related issues. Attached the logs.
> - Verified thread dumps. Found no thread related problems. 
> - On verification of logs, fetcher threads are not copying the map outputs 
> and they are just waiting for merge to happen.
> - Merge thread is alive and in wait state.
> *Analysis* 
> On careful observation of logs, thread dumps and code, this looks to me like 
> a classic case of multi-threading issue. Thread goes to wait state after it 
> has been notified. 
> Here is the suspect code flow.
> *Thread #1*
> Fetcher thread - notification comes first
> org.apache.hadoop.mapreduce.task.reduce.MergeThread.startMerge(Set)
> {code}
>   synchronized(pendingToBeMerged) {
> pendingToBeMerged.addLast(toMergeInputs);
> pendingToBeMerged.notifyAll();
>   }
> {code}
> *Thread #2*
> Merge Thread - goes to wait state (Notification goes unconsumed)
> org.apache.hadoop.mapreduce.task.reduce.MergeThread.run()
> {code}
> synchronized (pendingToBeMerged) {
>   while(pendingToBeMerged.size() <= 0) {
> pendingToBeMerged.wait();
>   }
>   // Pickup the inputs to merge.
>   inputs = pendingToBeMerged.removeFirst();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)