[ https://issues.apache.org/jira/browse/MAPREDUCE-935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748983#action_12748983 ]
Steve Loughran commented on MAPREDUCE-935: ------------------------------------------ Thread dump of the killed process {code} 2009-08-28 21:40:39,572 INFO mapred.Task (ReflectionUtils.java:logThreadInfo(204)) - Process Thread Dump: Communication exception 14 active threads Thread 26 (Thread for polling Map Completion Events): State: TIMED_WAITING Blocked count: 25 Waited count: 117 Stack: java.lang.Thread.sleep(Native Method) org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:392) org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:334) org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:197) org.apache.hadoop.ipc.Client.getConnection(Client.java:886) org.apache.hadoop.ipc.Client.call(Client.java:746) org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:223) org.apache.hadoop.mapred.$Proxy0.getMapCompletionEvents(Unknown Source) org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2725) org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2692) Thread 25 (Thread for merging in memory files): State: WAITING Blocked count: 12 Waited count: 13 Waiting on java.lang.obj...@500ab58d Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.mapred.ReduceTask$ReduceCopier$ShuffleRamManager.waitForDataToMerge(ReduceTask.java:1092) org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2590) Thread 24 (Thread for merging on-disk files): State: WAITING Blocked count: 0 Waited count: 1 Waiting on java.util.tree...@659c2931 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2493) Thread 23 (MapOutputCopier attempt_20090828213826033_0001_r_000000_0.4): State: WAITING Blocked count: 267 Waited count: 266 Waiting on java.util.arrayl...@15ab7626 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1219) Thread 22 (MapOutputCopier attempt_20090828213826033_0001_r_000000_0.3): State: WAITING Blocked count: 273 Waited count: 270 Waiting on java.util.arrayl...@15ab7626 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1219) Thread 21 (MapOutputCopier attempt_20090828213826033_0001_r_000000_0.2): State: WAITING Blocked count: 274 Waited count: 275 Waiting on java.util.arrayl...@15ab7626 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1219) Thread 20 (MapOutputCopier attempt_20090828213826033_0001_r_000000_0.1): State: WAITING Blocked count: 257 Waited count: 257 Waiting on java.util.arrayl...@15ab7626 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1219) Thread 19 (MapOutputCopier attempt_20090828213826033_0001_r_000000_0.0): State: WAITING Blocked count: 77 Waited count: 72 Waiting on java.util.arrayl...@15ab7626 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1219) Thread 18 (communication thread): State: RUNNABLE Blocked count: 16 Waited count: 34 Stack: sun.management.ThreadImpl.getThreadInfo0(Native Method) sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147) sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123) org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149) org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203) org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:532) java.lang.Thread.run(Thread.java:637) Thread 15 (Thread for syncLogs): State: TIMED_WAITING Blocked count: 0 Waited count: 24 Stack: java.lang.Thread.sleep(Native Method) org.apache.hadoop.mapred.Child$2.run(Child.java:88) Thread 5 (Signal Dispatcher): State: RUNNABLE Blocked count: 0 Waited count: 0 Stack: Thread 3 (Finalizer): State: WAITING Blocked count: 5 Waited count: 6 Waiting on java.lang.ref.referencequeue$l...@21e30857 Stack: java.lang.Object.wait(Native Method) java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) Thread 2 (Reference Handler): State: WAITING Blocked count: 7 Waited count: 6 Waiting on java.lang.ref.reference$l...@43cda81e Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) Thread 1 (main): State: TIMED_WAITING Blocked count: 16 Waited count: 26 Stack: java.lang.Thread.sleep(Native Method) org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:2027) org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:374) org.apache.hadoop.mapred.Child.main(Child.java:162) {code} > There's little to be gained by putting a host into the penaltybox at reduce > time, if its the only host you have > --------------------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-935 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-935 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: tasktracker > Affects Versions: 0.21.0 > Reporter: Steve Loughran > > Exponential backoff may be good for dealing with troublesome hosts, but not > if you only have one host in the entire system. From the log of > {{TestNodeRefresh}}, which for some reason is blocking in the reduce phase, I > can see it doesn't take much for the backoff to kick in so rapidly that the > reducer is waiting for longer than the test > {code} > 2009-08-28 21:39:16,788 WARN mapred.ReduceTask > (ReduceTask.java:fetchOutputs(2192)) - > attempt_20090828213826033_0001_r_000000_0 adding host localhost to penalty > box, next contact in 150 seconds > {code} > The result of this backoff process is that the reduce process ends up > appearing to hang, getting killed from above. > Note that this isn't the root cause of the problem, but it certainly > amplifies things. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.