[ 
https://issues.apache.org/jira/browse/TEZ-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983568#comment-13983568
 ] 

Tassapol Athiapinya commented on TEZ-1088:
------------------------------------------

[~hitesh] I will investigate where node blacklisting config set in 
TestFaultTolerance.java is effective or not.

For more explanations, depends on timing, AM could blacklist a node after 
failures. Once blacklisting happens, the attempts on that node will be 
discarded. Those attempts will be relaunched on other nodes. In that case 
attempt number will go up and cause expected output to be mismatched.
{code}
Expected output: 4 got: 5
{code}


> Flaky Test: 
> TestFaultTolerance.testInputFailureCausesRerunAttemptWithinMaxAttemptSuccess
> ----------------------------------------------------------------------------------------
>
>                 Key: TEZ-1088
>                 URL: https://issues.apache.org/jira/browse/TEZ-1088
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Assignee: Tassapol Athiapinya
>
> 2014-04-28 20:14:19,100 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.history.recovery.RecoveryService: DAG completed, 
> dagId=dag_1398715972246_0001_9, queueSize=0
> 2014-04-28 20:14:19,147 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1398715972246_0001_9][Event:DAG_FINISHED]: 
> dagId=dag_1398715972246_0001_9, startTime=1398716046982, 
> finishTime=1398716059090, timeTaken=12108, status=FAILED, diagnostics=Vertex 
> re-running, vertexName=v1, vertexId=vertex_1398715972246_0001_9_00
> Vertex failed, vertexName=v2, vertexId=vertex_1398715972246_0001_9_01, 
> diagnostics=[Task failed, taskId=task_1398715972246_0001_9_01_000001, 
> diagnostics=[AttemptID:attempt_1398715972246_0001_9_01_000001_0 Info:Error: 
> exceptionThrown=java.lang.RuntimeException: Expected output mismatch of 
> current FailingProcessor: attempt_1398715972246_0001_9_01_000001_0_10001 dag: 
> testInputFailureCausesRerunAttemptWithinMaxAttemptSuccess vertex: v2 
> taskIndex: 1 taskAttempt: 0
> Expected output: 4 got: 5
>       at 
> org.apache.tez.test.TestProcessor.throwException(TestProcessor.java:98)
>       at org.apache.tez.test.TestProcessor.run(TestProcessor.java:250)
>       at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:307)
>       at 
> org.apache.hadoop.mapred.YarnTezDagChild$5.run(YarnTezDagChild.java:581)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
>       at 
> org.apache.hadoop.mapred.YarnTezDagChild.main(YarnTezDagChild.java:570)
> , errorMessage=Expected output mismatch of current FailingProcessor: 
> attempt_1398715972246_0001_9_01_000001_0_10001 dag: 
> testInputFailureCausesRerunAttemptWithinMaxAttemptSuccess vertex: v2 
> taskIndex: 1 taskAttempt: 0
> Expected output: 4 got: 5
> Container killed by the ApplicationMaster.
> Container killed on request. Exit code is 143
> , AttemptID:attempt_1398715972246_0001_9_01_000001_1 Info:Error: 
> exceptionThrown=java.lang.RuntimeException: Expected output mismatch of 
> current FailingProcessor: attempt_1398715972246_0001_9_01_000001_1_10011 dag: 
> testInputFailureCausesRerunAttemptWithinMaxAttemptSuccess vertex: v2 
> taskIndex: 1 taskAttempt: 1
> Expected output: 4 got: 5
>       at 
> org.apache.tez.test.TestProcessor.throwException(TestProcessor.java:98)
>       at org.apache.tez.test.TestProcessor.run(TestProcessor.java:250)
>       at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:307)
>       at 
> org.apache.hadoop.mapred.YarnTezDagChild$5.run(YarnTezDagChild.java:581)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
>       at 
> org.apache.hadoop.mapred.YarnTezDagChild.main(YarnTezDagChild.java:570)
> , errorMessage=Expected output mismatch of current FailingProcessor: 
> attempt_1398715972246_0001_9_01_000001_1_10011 dag: 
> testInputFailureCausesRerunAttemptWithinMaxAttemptSuccess vertex: v2 
> taskIndex: 1 taskAttempt: 1
> Expected output: 4 got: 5
> Container released by application, 
> AttemptID:attempt_1398715972246_0001_9_01_000001_2 Info:Error: 
> exceptionThrown=java.lang.RuntimeException: Expected output mismatch of 
> current FailingProcessor: attempt_1398715972246_0001_9_01_000001_2_10003 dag: 
> testInputFailureCausesRerunAttemptWithinMaxAttemptSuccess vertex: v2 
> taskIndex: 1 taskAttempt: 2
> Expected output: 4 got: 6
>       at 
> org.apache.tez.test.TestProcessor.throwException(TestProcessor.java:98)
>       at org.apache.tez.test.TestProcessor.run(TestProcessor.java:250)
>       at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:307)
>       at 
> org.apache.hadoop.mapred.YarnTezDagChild$5.run(YarnTezDagChild.java:581)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
>       at 
> org.apache.hadoop.mapred.YarnTezDagChild.main(YarnTezDagChild.java:570)
> , errorMessage=Expected output mismatch of current FailingProcessor: 
> attempt_1398715972246_0001_9_01_000001_2_10003 dag: 
> testInputFailureCausesRerunAttemptWithinMaxAttemptSuccess vertex: v2 
> taskIndex: 1 taskAttempt: 2
> Expected output: 4 got: 6
> Container released by application, 
> AttemptID:attempt_1398715972246_0001_9_01_000001_3 Info:Error: 
> exceptionThrown=java.lang.RuntimeException: Expected output mismatch of 
> current FailingProcessor: attempt_1398715972246_0001_9_01_000001_3_10001 dag: 
> testInputFailureCausesRerunAttemptWithinMaxAttemptSuccess vertex: v2 
> taskIndex: 1 taskAttempt: 3
> Expected output: 4 got: 8
>       at 
> org.apache.tez.test.TestProcessor.throwException(TestProcessor.java:98)
>       at org.apache.tez.test.TestProcessor.run(TestProcessor.java:250)
>       at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:307)
>       at 
> org.apache.hadoop.mapred.YarnTezDagChild$5.run(YarnTezDagChild.java:581)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
>       at 
> org.apache.hadoop.mapred.YarnTezDagChild.main(YarnTezDagChild.java:570)
> , errorMessage=Expected output mismatch of current FailingProcessor: 
> attempt_1398715972246_0001_9_01_000001_3_10001 dag: 
> testInputFailureCausesRerunAttemptWithinMaxAttemptSuccess vertex: v2 
> taskIndex: 1 taskAttempt: 3
> Expected output: 4 got: 8], Vertex failed as one or more tasks failed. 
> failedTasks:1]
> DAG failed due to vertex failure. failedVertices:1 killedVertices:0, 
> counters=Counters: 12, org.apache.tez.common.counters.DAGCounter, 
> NUM_FAILED_TASKS=6, TOTAL_LAUNCHED_TASKS=9, File System Counters, HDFS: 
> BYTES_READ=0, HDFS: BYTES_WRITTEN=0, HDFS: READ_OPS=0, HDFS: 
> LARGE_READ_OPS=0, HDFS: WRITE_OPS=0, 
> org.apache.tez.common.counters.TaskCounter, GC_TIME_MILLIS=21, 
> CPU_MILLISECONDS=-12330, PHYSICAL_MEMORY_BYTES=548294656, 
> VIRTUAL_MEMORY_BYTES=4034506752, COMMITTED_HEAP_BYTES=310968320
> 2014-04-28 20:14:19,147 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.DAGImpl: DAG: dag_1398715972246_0001_9 
> finished with state: FAILED
> 2014-04-28 20:14:19,148 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.DAGImpl: dag_1398715972246_0001_9 
> transitioned from RUNNING to FAILED
> 2014-04-28 20:14:19,148 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.DAGAppMaster: DAG completed, 
> dagId=dag_1398715972246_0001_9, dagState=FAILED
> 2014-04-28 20:14:19,148 INFO [AsyncDispatcher event handler] 
> org.apache.tez.common.TezUtils: Redirecting log files based on addend: 
> dag_1398715972246_0001_9_post
> 2014-04-28 20:14:19,148 INFO [ContainerLauncher #4] 
> org.apache.tez.dag.app.launcher.ContainerLauncherImpl: Processing the event 
> EventType: CONTAINER_STOP_REQUEST



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to