[ 
https://issues.apache.org/jira/browse/TEZ-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389882#comment-14389882
 ] 

Jeff Zhang commented on TEZ-2260:
---------------------------------

Create TEZ-2261 for adding diagnostics in DAGAppMaster when recovery error 
happens

> AM been shutdown due to NoSuchMethodError in DAGProtos
> ------------------------------------------------------
>
>                 Key: TEZ-2260
>                 URL: https://issues.apache.org/jira/browse/TEZ-2260
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>         Attachments: applog.tar
>
>
> Not sure why this happens, maybe due to environment issue.
> {code}
> 2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central] 
> history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1427850436467_0007_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=datagen, taskAttemptId=attempt_1427850436467_0007_1_00_000000_0, 
> startTime=1427850527981, finishTime=1427850529750, timeTaken=1769, 
> status=SUCCEEDED, errorEnum=, diagnostics=, counters=Counters: 8, File System 
> Counters, HDFS_BYTES_READ=0, HDFS_BYTES_WRITTEN=953030, HDFS_READ_OPS=9, 
> HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=6, 
> org.apache.tez.common.counters.TaskCounter, GC_TIME_MILLIS=46, 
> COMMITTED_HEAP_BYTES=257425408, OUTPUT_RECORDS=44195
> 2015-04-01 09:08:49,757 FATAL [RecoveryEventHandlingThread] 
> yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[RecoveryEventHandlingThread,5,main] threw an Error.  Shutting down 
> now...
> java.lang.NoSuchMethodError: 
> org.apache.tez.dag.api.records.DAGProtos$TezCountersProto$Builder.access$26000()Lorg/apache/tez/dag/api/records/DAGProtos$TezCountersProto$Builder;
>       at 
> org.apache.tez.dag.api.records.DAGProtos$TezCountersProto.newBuilder(DAGProtos.java:24581)
>       at 
> org.apache.tez.dag.api.DagTypeConverters.convertTezCountersToProto(DagTypeConverters.java:544)
>       at 
> org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProto(TaskAttemptFinishedEvent.java:97)
>       at 
> org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProtoStream(TaskAttemptFinishedEvent.java:120)
>       at 
> org.apache.tez.dag.history.recovery.RecoveryService.handleRecoveryEvent(RecoveryService.java:403)
>       at 
> org.apache.tez.dag.history.recovery.RecoveryService.access$700(RecoveryService.java:50)
>       at 
> org.apache.tez.dag.history.recovery.RecoveryService$1.run(RecoveryService.java:158)
>       at java.lang.Thread.run(Thread.java:745)
> 2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central] 
> impl.TaskAttemptImpl: attempt_1427850436467_0007_1_00_000000_0 TaskAttempt 
> Transitioned from RUNNING to SUCCEEDED due to event TA_DONE
> {code}
> This issue result in several consequent issues. Because this error cause the 
> AM to recovery in the next attempt. But in the next attempt it meet the 
> following issue, looks like data node crashed.
> {code}
> 2015-04-01 09:09:00,093 WARN [Thread-82] hdfs.DFSClient: DataStreamer 
> Exception
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, 
> 127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, 
> and a client may configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
> 2015-04-01 09:09:00,093 WARN [Dispatcher thread: Central] hdfs.DFSClient: 
> Error while syncing
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, 
> 127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, 
> and a client may configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
> 2015-04-01 09:09:00,094 ERROR [Dispatcher thread: Central] 
> recovery.RecoveryService: Error handling summary event, 
> eventType=VERTEX_FINISHED
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, 
> 127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, 
> and a client may configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
> {code}
> Because of the above issue (summary recovery log error), it cause the AM 
> shutdown, and in the client side, it throw SessionNotRunning Exception 
> without any diagnostic info. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to