[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy reassigned MAPREDUCE-3058:
----------------------------------------

    Assignee: Hitesh Shah  (was: Vinod Kumar Vavilapalli)

Hitesh - can you please take a look since you are already working on related 
areas?

Currently the TT sends SIGTERM followed by SIGKILL...
                
> Sometimes task keeps on running while its Syslog says that it is shutdown
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3058
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3058
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/gridmix, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Karam Singh
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3058-20110923.txt
>
>
> While running GridMixV3, one of the jobs got stuck for 15 hrs. After clicking 
> on the Job-page, found one of its reduces to be stuck. Looking at syslog of 
> the stuck reducer, found this:
> Task-logs' head:
> {code}
> 2011-09-19 17:57:22,002 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period 
> at 10 second(s).
> 2011-09-19 17:57:22,002 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system 
> started
> {code}
> Task-logs' tail:
> {code}
> 2011-09-19 18:06:49,818 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
> createBlockOutputStream java.io.IOException: Bad connect ack with 
> firstBadLink as <DATANODE1>
> 2011-09-19 18:06:49,818 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
> for block 
> BP-1405370709-<NAMENODE>-1316452621953:blk_-7004355226367468317_79871 in 
> pipeline  <DATANODE2>,  <DATANODE1>: bad datanode  <DATANODE1>
> 2011-09-19 18:06:49,818 DEBUG 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtocol: 
> lastAckedSeqno = 26870
> 2011-09-19 18:06:49,820 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <NAMENODE> from gridperf sending #454
> 2011-09-19 18:06:49,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <<NAMENODE> from gridperf got value #454
> 2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.ipc.RPC: Call: 
> getAdditionalDatanode 8
> 2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Connecting to 
> datanode <DATANODE2>
> 2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Send buf size 
> 131071
> 2011-09-19 18:06:49,833 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 
> Exception
> java.io.EOFException: Premature EOF: no length prefix available
>         at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)
> 2011-09-19 18:06:49,837 WARN org.apache.hadoop.mapred.YarnChild: Exception 
> running child : java.io.EOFException: Premature EOF: no length prefix 
> available
>         at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)
> 2011-09-19 18:06:49,837 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #455
> 2011-09-19 18:06:49,839 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value 
> #455
> 2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.RPC: Call: statusUpdate 3
> 2011-09-19 18:06:49,840 INFO org.apache.hadoop.mapred.Task: Runnning cleanup 
> for the task
> 2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <NAMENODE> from gridperf sending #456
> 2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <NAMENODE> from gridperf got value #456
> 2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.RPC: Call: delete 18
> 2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #457
> 2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.Client: IPC Client 
> (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value 
> #457
> 2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.RPC: Call: 
> reportDiagnosticInfo 1
> 2011-09-19 18:06:49,859 DEBUG 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: refCount=1
> 2011-09-19 18:06:49,859 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask 
> metrics system...
> 2011-09-19 18:06:49,859 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source 
> UgiMetrics
> 2011-09-19 18:06:49,859 DEBUG 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class 
> org.apache.hadoop.metrics2.lib.MetricsSourceBuilder$1
> 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: 
> Unregistering Hadoop:service=ReduceTask,name=UgiMetrics
> 2011-09-19 18:06:49,860 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source 
> JvmMetrics
> 2011-09-19 18:06:49,860 DEBUG 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class 
> org.apache.hadoop.metrics2.source.JvmMetrics
> 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: 
> Unregistering Hadoop:service=ReduceTask,name=JvmMetrics
> 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: 
> Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Stats
> 2011-09-19 18:06:49,860 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system 
> stopped.
> 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: 
> Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Control
> 2011-09-19 18:06:49,860 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system 
> shutdown complete.
> {code}
> Which means that tasks is supposed to have stopped within 20 secs, whereas 
> the process itself is stuck for more than 15 hours. From AM log, also found 
> that this task was sending its update regularly. ps -ef | grep java was also 
> showing that process is still alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to