[ 
https://issues.apache.org/jira/browse/MAPREDUCE-935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748983#action_12748983
 ] 

Steve Loughran commented on MAPREDUCE-935:
------------------------------------------

Thread dump of the killed process
{code}
2009-08-28 21:40:39,572 INFO  mapred.Task 
(ReflectionUtils.java:logThreadInfo(204)) - Process Thread Dump: Communication 
exception
14 active threads
Thread 26 (Thread for polling Map Completion Events):
  State: TIMED_WAITING
  Blocked count: 25
  Waited count: 117
  Stack:
    java.lang.Thread.sleep(Native Method)
    
org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:392)
    org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:334)
    org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:197)
    org.apache.hadoop.ipc.Client.getConnection(Client.java:886)
    org.apache.hadoop.ipc.Client.call(Client.java:746)
    org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:223)
    org.apache.hadoop.mapred.$Proxy0.getMapCompletionEvents(Unknown Source)
    
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2725)
    
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2692)
Thread 25 (Thread for merging in memory files):
  State: WAITING
  Blocked count: 12
  Waited count: 13
  Waiting on java.lang.obj...@500ab58d
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$ShuffleRamManager.waitForDataToMerge(ReduceTask.java:1092)
    
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2590)
Thread 24 (Thread for merging on-disk files):
  State: WAITING
  Blocked count: 0
  Waited count: 1
  Waiting on java.util.tree...@659c2931
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2493)
Thread 23 (MapOutputCopier attempt_20090828213826033_0001_r_000000_0.4):
  State: WAITING
  Blocked count: 267
  Waited count: 266
  Waiting on java.util.arrayl...@15ab7626
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1219)
Thread 22 (MapOutputCopier attempt_20090828213826033_0001_r_000000_0.3):
  State: WAITING
  Blocked count: 273
  Waited count: 270
  Waiting on java.util.arrayl...@15ab7626
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1219)
Thread 21 (MapOutputCopier attempt_20090828213826033_0001_r_000000_0.2):
  State: WAITING
  Blocked count: 274
  Waited count: 275
  Waiting on java.util.arrayl...@15ab7626
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1219)
Thread 20 (MapOutputCopier attempt_20090828213826033_0001_r_000000_0.1):
  State: WAITING
  Blocked count: 257
  Waited count: 257
  Waiting on java.util.arrayl...@15ab7626
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1219)
Thread 19 (MapOutputCopier attempt_20090828213826033_0001_r_000000_0.0):
  State: WAITING
  Blocked count: 77
  Waited count: 72
  Waiting on java.util.arrayl...@15ab7626
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1219)
Thread 18 (communication thread):
  State: RUNNABLE
  Blocked count: 16
  Waited count: 34
  Stack:
    sun.management.ThreadImpl.getThreadInfo0(Native Method)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
    
org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)
    
org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
    org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:532)
    java.lang.Thread.run(Thread.java:637)
Thread 15 (Thread for syncLogs):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 24
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.mapred.Child$2.run(Child.java:88)
Thread 5 (Signal Dispatcher):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
Thread 3 (Finalizer):
  State: WAITING
  Blocked count: 5
  Waited count: 6
  Waiting on java.lang.ref.referencequeue$l...@21e30857
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread 2 (Reference Handler):
  State: WAITING
  Blocked count: 7
  Waited count: 6
  Waiting on java.lang.ref.reference$l...@43cda81e
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread 1 (main):
  State: TIMED_WAITING
  Blocked count: 16
  Waited count: 26
  Stack:
    java.lang.Thread.sleep(Native Method)
    
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:2027)
    org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:374)
    org.apache.hadoop.mapred.Child.main(Child.java:162)


{code}

> There's little to be gained by putting a host into the penaltybox at reduce 
> time, if its the only host you have
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-935
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-935
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions: 0.21.0
>            Reporter: Steve Loughran
>
> Exponential backoff may be good for dealing with troublesome hosts, but not 
> if you only have one host in the entire system. From the log of 
> {{TestNodeRefresh}}, which for some reason is blocking in the reduce phase, I 
> can see it doesn't take much for the backoff to kick in so rapidly that the 
> reducer is waiting for longer than the test
> {code}
> 2009-08-28 21:39:16,788 WARN  mapred.ReduceTask 
> (ReduceTask.java:fetchOutputs(2192)) - 
> attempt_20090828213826033_0001_r_000000_0 adding host localhost to penalty 
> box, next contact in 150 seconds
> {code}
> The result of this backoff process is that the reduce process ends up 
> appearing to hang, getting killed from above. 
> Note that this isn't the root cause of the problem, but it certainly 
> amplifies things. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to