[ https://issues.apache.org/jira/browse/MAPREDUCE-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy reassigned MAPREDUCE-3058: ---------------------------------------- Assignee: Hitesh Shah (was: Vinod Kumar Vavilapalli) Hitesh - can you please take a look since you are already working on related areas? Currently the TT sends SIGTERM followed by SIGKILL... > Sometimes task keeps on running while its Syslog says that it is shutdown > ------------------------------------------------------------------------- > > Key: MAPREDUCE-3058 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3058 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: contrib/gridmix, mrv2 > Affects Versions: 0.23.0 > Reporter: Karam Singh > Assignee: Hitesh Shah > Priority: Critical > Fix For: 0.23.0 > > Attachments: MAPREDUCE-3058-20110923.txt > > > While running GridMixV3, one of the jobs got stuck for 15 hrs. After clicking > on the Job-page, found one of its reduces to be stuck. Looking at syslog of > the stuck reducer, found this: > Task-logs' head: > {code} > 2011-09-19 17:57:22,002 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period > at 10 second(s). > 2011-09-19 17:57:22,002 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system > started > {code} > Task-logs' tail: > {code} > 2011-09-19 18:06:49,818 INFO org.apache.hadoop.hdfs.DFSClient: Exception in > createBlockOutputStream java.io.IOException: Bad connect ack with > firstBadLink as <DATANODE1> > 2011-09-19 18:06:49,818 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery > for block > BP-1405370709-<NAMENODE>-1316452621953:blk_-7004355226367468317_79871 in > pipeline <DATANODE2>, <DATANODE1>: bad datanode <DATANODE1> > 2011-09-19 18:06:49,818 DEBUG > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtocol: > lastAckedSeqno = 26870 > 2011-09-19 18:06:49,820 DEBUG org.apache.hadoop.ipc.Client: IPC Client > (26613121) connection to <NAMENODE> from gridperf sending #454 > 2011-09-19 18:06:49,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client > (26613121) connection to <<NAMENODE> from gridperf got value #454 > 2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.ipc.RPC: Call: > getAdditionalDatanode 8 > 2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Connecting to > datanode <DATANODE2> > 2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Send buf size > 131071 > 2011-09-19 18:06:49,833 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer > Exception > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415) > 2011-09-19 18:06:49,837 WARN org.apache.hadoop.mapred.YarnChild: Exception > running child : java.io.EOFException: Premature EOF: no length prefix > available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415) > 2011-09-19 18:06:49,837 DEBUG org.apache.hadoop.ipc.Client: IPC Client > (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #455 > 2011-09-19 18:06:49,839 DEBUG org.apache.hadoop.ipc.Client: IPC Client > (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value > #455 > 2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.RPC: Call: statusUpdate 3 > 2011-09-19 18:06:49,840 INFO org.apache.hadoop.mapred.Task: Runnning cleanup > for the task > 2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.Client: IPC Client > (26613121) connection to <NAMENODE> from gridperf sending #456 > 2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client > (26613121) connection to <NAMENODE> from gridperf got value #456 > 2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.RPC: Call: delete 18 > 2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client > (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #457 > 2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.Client: IPC Client > (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value > #457 > 2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.RPC: Call: > reportDiagnosticInfo 1 > 2011-09-19 18:06:49,859 DEBUG > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: refCount=1 > 2011-09-19 18:06:49,859 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask > metrics system... > 2011-09-19 18:06:49,859 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source > UgiMetrics > 2011-09-19 18:06:49,859 DEBUG > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class > org.apache.hadoop.metrics2.lib.MetricsSourceBuilder$1 > 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: > Unregistering Hadoop:service=ReduceTask,name=UgiMetrics > 2011-09-19 18:06:49,860 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source > JvmMetrics > 2011-09-19 18:06:49,860 DEBUG > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class > org.apache.hadoop.metrics2.source.JvmMetrics > 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: > Unregistering Hadoop:service=ReduceTask,name=JvmMetrics > 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: > Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Stats > 2011-09-19 18:06:49,860 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system > stopped. > 2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: > Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Control > 2011-09-19 18:06:49,860 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system > shutdown complete. > {code} > Which means that tasks is supposed to have stopped within 20 secs, whereas > the process itself is stuck for more than 15 hours. From AM log, also found > that this task was sending its update regularly. ps -ef | grep java was also > showing that process is still alive. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira