Jeff Zhang created TEZ-2260:
-------------------------------

             Summary: AM been shutdown due to NoSuchMethodError in DAGProtos
                 Key: TEZ-2260
                 URL: https://issues.apache.org/jira/browse/TEZ-2260
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Jeff Zhang


Not sure why this happens, maybe due to environment issue.

{code}
2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_1427850436467_0007_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=datagen, taskAttemptId=attempt_1427850436467_0007_1_00_000000_0, 
startTime=1427850527981, finishTime=1427850529750, timeTaken=1769, 
status=SUCCEEDED, errorEnum=, diagnostics=, counters=Counters: 8, File System 
Counters, HDFS_BYTES_READ=0, HDFS_BYTES_WRITTEN=953030, HDFS_READ_OPS=9, 
HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=6, 
org.apache.tez.common.counters.TaskCounter, GC_TIME_MILLIS=46, 
COMMITTED_HEAP_BYTES=257425408, OUTPUT_RECORDS=44195
2015-04-01 09:08:49,757 FATAL [RecoveryEventHandlingThread] 
yarn.YarnUncaughtExceptionHandler: Thread 
Thread[RecoveryEventHandlingThread,5,main] threw an Error.  Shutting down now...
java.lang.NoSuchMethodError: 
org.apache.tez.dag.api.records.DAGProtos$TezCountersProto$Builder.access$26000()Lorg/apache/tez/dag/api/records/DAGProtos$TezCountersProto$Builder;
        at 
org.apache.tez.dag.api.records.DAGProtos$TezCountersProto.newBuilder(DAGProtos.java:24581)
        at 
org.apache.tez.dag.api.DagTypeConverters.convertTezCountersToProto(DagTypeConverters.java:544)
        at 
org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProto(TaskAttemptFinishedEvent.java:97)
        at 
org.apache.tez.dag.history.events.TaskAttemptFinishedEvent.toProtoStream(TaskAttemptFinishedEvent.java:120)
        at 
org.apache.tez.dag.history.recovery.RecoveryService.handleRecoveryEvent(RecoveryService.java:403)
        at 
org.apache.tez.dag.history.recovery.RecoveryService.access$700(RecoveryService.java:50)
        at 
org.apache.tez.dag.history.recovery.RecoveryService$1.run(RecoveryService.java:158)
        at java.lang.Thread.run(Thread.java:745)
2015-04-01 09:08:49,757 INFO [Dispatcher thread: Central] impl.TaskAttemptImpl: 
attempt_1427850436467_0007_1_00_000000_0 TaskAttempt Transitioned from RUNNING 
to SUCCEEDED due to event TA_DONE
{code}

This issue result in several consequent issues. Because this error cause the AM 
to recovery in the next attempt. But in the next attempt it meet the following 
issue, looks like data node crashed.
{code}
2015-04-01 09:09:00,093 WARN [Thread-82] hdfs.DFSClient: DataStreamer Exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, 
127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, 
and a client may configure this via 
'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
configuration.
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
2015-04-01 09:09:00,093 WARN [Dispatcher thread: Central] hdfs.DFSClient: Error 
while syncing
java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, 
127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, 
and a client may configure this via 
'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
configuration.
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
2015-04-01 09:09:00,094 ERROR [Dispatcher thread: Central] 
recovery.RecoveryService: Error handling summary event, 
eventType=VERTEX_FINISHED
java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[127.0.0.1:56238, 127.0.0.1:56234], original=[127.0.0.1:56238, 
127.0.0.1:56234]). The current failed datanode replacement policy is DEFAULT, 
and a client may configure this via 
'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
configuration.
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
{code}

Because of the above issue (summary recovery log error), it cause the AM 
shutdown, and in the client side, it throw SessionNotRunning Exception without 
any diagnostic info. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to