[ https://issues.apache.org/jira/browse/TEZ-4364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
László Bodor updated TEZ-4364: ------------------------------ Description: TLDR: after TEZ-4388, TestFaultTolerance test becomes flakier recently. It's important to be investigated because a unit test failure could also imply a product bug while handling failure scenarios. According to surefire process' jstack, it can be reproduced only by TestFaultTolerance.testBasicInputFailureWithoutExitDeadline [^surefire_jstack.log] {code} "Thread-1355" #1569 prio=5 os_prio=31 tid=0x00007fe76660c800 nid=0x43d07 waiting on condition [0x000070002ab38000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:155) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:142) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:138) at org.apache.tez.test.TestFaultTolerance.testBasicInputFailureWithoutExitDeadline(TestFaultTolerance.java:351) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) {code} this is when it waits for the DAG to finish was: TestFaultTolerance test becomes flakier recently. It's important to be investigated because a unit test failure could also imply a product bug while handling failure scenarios. According to surefire process' jstack, it can be reproduced only by TestFaultTolerance.testBasicInputFailureWithoutExitDeadline [^surefire_jstack.log] {code} "Thread-1355" #1569 prio=5 os_prio=31 tid=0x00007fe76660c800 nid=0x43d07 waiting on condition [0x000070002ab38000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:155) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:142) at org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:138) at org.apache.tez.test.TestFaultTolerance.testBasicInputFailureWithoutExitDeadline(TestFaultTolerance.java:351) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) {code} this is when it waits for the DAG to finish > TestFaultTolerance timeout on master - TestInput fix after TEZ-4338 > ------------------------------------------------------------------- > > Key: TEZ-4364 > URL: https://issues.apache.org/jira/browse/TEZ-4364 > Project: Apache Tez > Issue Type: Bug > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > Attachments: surefire_jstack.log, > syslog_attempt_1640554229092_0001_1_01_000002_0 > > Time Spent: 20m > Remaining Estimate: 0h > > TLDR: after TEZ-4388, > TestFaultTolerance test becomes flakier recently. It's important to be > investigated because a unit test failure could also imply a product bug while > handling failure scenarios. > According to surefire process' jstack, it can be reproduced only by > TestFaultTolerance.testBasicInputFailureWithoutExitDeadline > [^surefire_jstack.log] > {code} > "Thread-1355" #1569 prio=5 os_prio=31 tid=0x00007fe76660c800 nid=0x43d07 > waiting on condition [0x000070002ab38000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:155) > at > org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:142) > at > org.apache.tez.test.TestFaultTolerance.runDAGAndVerify(TestFaultTolerance.java:138) > at > org.apache.tez.test.TestFaultTolerance.testBasicInputFailureWithoutExitDeadline(TestFaultTolerance.java:351) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > {code} > this is when it waits for the DAG to finish -- This message was sent by Atlassian Jira (v8.20.1#820001)