[ https://issues.apache.org/jira/browse/TEZ-4364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17465470#comment-17465470 ]
László Bodor commented on TEZ-4364: ----------------------------------- looks like this is cause by TEZ-4338, as I found in a task attempt log: [^syslog_attempt_1640554229092_0001_1_01_000002_0] {code} 2021-12-26 22:30:39,354 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: TaskReporter reporter error which will cause the task to fail java.lang.NullPointerException at org.apache.tez.runtime.api.events.EventProtos$InputReadErrorEventProto$Builder.setDestinationLocalhostName(EventProtos.java:2508) at org.apache.tez.runtime.api.impl.TezEvent.serializeEvent(TezEvent.java:196) at org.apache.tez.runtime.api.impl.TezEvent.write(TezEvent.java:349) at org.apache.tez.runtime.api.impl.TezHeartbeatRequest.write(TezHeartbeatRequest.java:98) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:202) at org.apache.hadoop.ipc.WritableRpcEngine$Invocation.write(WritableRpcEngine.java:176) at org.apache.hadoop.ipc.RpcWritable$WritableWrapper.writeTo(RpcWritable.java:75) at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1133) at org.apache.hadoop.ipc.Client.call(Client.java:1458) at org.apache.hadoop.ipc.Client.call(Client.java:1405) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251) at com.sun.proxy.$Proxy8.heartbeat(Unknown Source) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:278) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:202) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:136) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} after fixing TestInput to properly fill hostname in InputReadErrorEvent, issue cannot be reproduced > TestFaultTolerance timeout on master > ------------------------------------ > > Key: TEZ-4364 > URL: https://issues.apache.org/jira/browse/TEZ-4364 > Project: Apache Tez > Issue Type: Bug > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > Attachments: surefire_jstack.log, > syslog_attempt_1640554229092_0001_1_01_000002_0 > > > TestFaultTolerance test becomes flakier recently. It's important to be > investigated because a unit test failure could also imply a product bug while > handling failure scenarios. > According to surefire process' jstack, it can be reproduced only by > TestFaultTolerance.testBasicInputFailureWithoutExitDeadline > [^surefire_jstack.log] > {code} > "Thread-1355" #1569 prio=5 os_prio=31 tid=0x00007fe76660c800 nid=0x43d07 > waiting on condition [0x000070002ab38000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:155) > at > org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:142) > at > org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:138) > at > org.apache.tez.test.TestFaultTolerance.testBasicInputFailureWithoutExitDeadline(TestFaultTolerance.java:351) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > {code} > this is when it waits for the DAG to finish -- This message was sent by Atlassian Jira (v8.20.1#820001)