[ https://issues.apache.org/jira/browse/TEZ-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389882#comment-14389882 ]
Jeff Zhang commented on TEZ-2260: --------------------------------- Create TEZ-2261 for adding diagnostics in DAGAppMaster when recovery error happens > AM been shutdown due to NoSuchMethodError in DAGProtos > ------------------------------------------------------ > > Key: TEZ-2260 > URL: https://issues.apache.org/jira/browse/TEZ-2260 > Project: Apache Tez > Issue Type: Bug > Reporter: Jeff Zhang > Attachments: applog.tar > > > Not sure why this happens, maybe due to environment issue. > {code} > 2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central] > history.HistoryEventHandler: > [HISTORY][DAG:dag_1427850436467_0007_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=datagen, taskAttemptId=attempt_1427850436467_0007_1_00_000000_0, > startTime=1427850527981, finishTime=1427850529750, timeTaken=1769, > status=SUCCEEDED, errorEnum=, diagnostics=, counters=Counters: 8, File System > Counters, HDFS_BYTES_READ=0, HDFS_BYTES_WRITTEN=953030, HDFS_READ_OPS=9, > HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=6, > org.apache.tez.common.counters.TaskCounter, GC_TIME_MILLIS=46, > COMMITTED_HEAP_BYTES=257425408, OUTPUT_RECORDS=44195 > 2015-04-01 09:08:49,757 FATAL [RecoveryEventHandlingThread] > yarn.YarnUncaughtExceptionHandler: Thread > Thread[RecoveryEventHandlingThread,5,main] threw an Error. Shutting down > now... > java.lang.NoSuchMethodError: > org.apache.tez.dag.api.records.DAGProtos$TezCountersProto$Builder.access$26000()Lorg/apache/tez/dag/api/records/DAGProtos$TezCountersProto$Builder; > at > org.apache.tez.dag.api.records.DAGProtos$TezCountersProto.newBuilder(DAGProtos.java:24581) > at > org.apache.tez.dag.api.DagTypeConverters.convertTezCountersToProto(DagTypeConverters.java:544) > at > org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProto(TaskAttemptFinishedEvent.java:97) > at > org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProtoStream(TaskAttemptFinishedEvent.java:120) > at > org.apache.tez.dag.history.recovery.RecoveryService.handleRecoveryEvent(RecoveryService.java:403) > at > org.apache.tez.dag.history.recovery.RecoveryService.access$700(RecoveryService.java:50) > at > org.apache.tez.dag.history.recovery.RecoveryService$1.run(RecoveryService.java:158) > at java.lang.Thread.run(Thread.java:745) > 2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central] > impl.TaskAttemptImpl: attempt_1427850436467_0007_1_00_000000_0 TaskAttempt > Transitioned from RUNNING to SUCCEEDED due to event TA_DONE > {code} > This issue result in several consequent issues. Because this error cause the > AM to recovery in the next attempt. But in the next attempt it meet the > following issue, looks like data node crashed. > {code} > 2015-04-01 09:09:00,093 WARN [Thread-82] hdfs.DFSClient: DataStreamer > Exception > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, > 127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, > and a client may configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594) > 2015-04-01 09:09:00,093 WARN [Dispatcher thread: Central] hdfs.DFSClient: > Error while syncing > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, > 127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, > and a client may configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594) > 2015-04-01 09:09:00,094 ERROR [Dispatcher thread: Central] > recovery.RecoveryService: Error handling summary event, > eventType=VERTEX_FINISHED > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, > 127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, > and a client may configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594) > {code} > Because of the above issue (summary recovery log error), it cause the AM > shutdown, and in the client side, it throw SessionNotRunning Exception > without any diagnostic info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)