[ https://issues.apache.org/jira/browse/MAPREDUCE-6351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Laxman updated MAPREDUCE-6351: ------------------------------ Description: *Problem* Reducer gets stuck in copy phase and doesn't make progress for very long time. After killing this task for couple of times manually, it gets completed. *Observations* - Verfied gc logs. Found no memory related issues. Attached the logs. - Verified thread dumps. Found no thread related problems. - On verification of logs, fetcher threads are not copying the map outputs and they are just waiting for merge to happen. - Merge thread is alive and in wait state. {deleted} *Analysis* On careful observation of logs, thread dumps and code, this looks to me like a classic case of multi-threading issue. Thread goes to wait state after it has been notified. Here is the suspect code flow. *Thread #1* Fetcher thread - notification comes first org.apache.hadoop.mapreduce.task.reduce.MergeThread.startMerge(Set<T>) {code} synchronized(pendingToBeMerged) { pendingToBeMerged.addLast(toMergeInputs); pendingToBeMerged.notifyAll(); } {code} *Thread #2* Merge Thread - goes to wait state (Notification goes unconsumed) org.apache.hadoop.mapreduce.task.reduce.MergeThread.run() {code} synchronized (pendingToBeMerged) { while(pendingToBeMerged.size() <= 0) { pendingToBeMerged.wait(); } // Pickup the inputs to merge. inputs = pendingToBeMerged.removeFirst(); } {code} {deleted} was: *Problem* Reducer gets stuck in copy phase and doesn't make progress for very long time. After killing this task for couple of times manually, it gets completed. *Observations* - Verfied gc logs. Found no memory related issues. Attached the logs. - Verified thread dumps. Found no thread related problems. - On verification of logs, fetcher threads are not copying the map outputs and they are just waiting for merge to happen. - Merge thread is alive and in wait state. *Analysis* On careful observation of logs, thread dumps and code, this looks to me like a classic case of multi-threading issue. Thread goes to wait state after it has been notified. Here is the suspect code flow. *Thread #1* Fetcher thread - notification comes first org.apache.hadoop.mapreduce.task.reduce.MergeThread.startMerge(Set<T>) {code} synchronized(pendingToBeMerged) { pendingToBeMerged.addLast(toMergeInputs); pendingToBeMerged.notifyAll(); } {code} *Thread #2* Merge Thread - goes to wait state (Notification goes unconsumed) org.apache.hadoop.mapreduce.task.reduce.MergeThread.run() {code} synchronized (pendingToBeMerged) { while(pendingToBeMerged.size() <= 0) { pendingToBeMerged.wait(); } // Pickup the inputs to merge. inputs = pendingToBeMerged.removeFirst(); } {code} > Reducer hung in copy phase. > --------------------------- > > Key: MAPREDUCE-6351 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6351 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 2.6.0 > Reporter: Laxman > Attachments: jstat-gc.log, reducer-container-partial.log.zip, > thread-dumps.out > > > *Problem* > Reducer gets stuck in copy phase and doesn't make progress for very long > time. After killing this task for couple of times manually, it gets > completed. > *Observations* > - Verfied gc logs. Found no memory related issues. Attached the logs. > - Verified thread dumps. Found no thread related problems. > - On verification of logs, fetcher threads are not copying the map outputs > and they are just waiting for merge to happen. > - Merge thread is alive and in wait state. > {deleted} > *Analysis* > On careful observation of logs, thread dumps and code, this looks to me like > a classic case of multi-threading issue. Thread goes to wait state after it > has been notified. > Here is the suspect code flow. > *Thread #1* > Fetcher thread - notification comes first > org.apache.hadoop.mapreduce.task.reduce.MergeThread.startMerge(Set<T>) > {code} > synchronized(pendingToBeMerged) { > pendingToBeMerged.addLast(toMergeInputs); > pendingToBeMerged.notifyAll(); > } > {code} > *Thread #2* > Merge Thread - goes to wait state (Notification goes unconsumed) > org.apache.hadoop.mapreduce.task.reduce.MergeThread.run() > {code} > synchronized (pendingToBeMerged) { > while(pendingToBeMerged.size() <= 0) { > pendingToBeMerged.wait(); > } > // Pickup the inputs to merge. > inputs = pendingToBeMerged.removeFirst(); > } > {code} > {deleted} -- This message was sent by Atlassian JIRA (v6.3.4#6332)