[ https://issues.apache.org/jira/browse/FLINK-35438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849850#comment-17849850 ]
Rob Young commented on FLINK-35438: ----------------------------------- I agree there's a race that's hard to reproduce, I can only provoke it by adding in a thread sleep in the spot where it can occur in `MockOperatorCoordinatorContext` {code:java} @Override public void failJob(Throwable cause) { jobFailed = true; try { Thread.sleep(50); } catch (InterruptedException e) { throw new RuntimeException(e); } jobFailureReason = cause; jobFailedFuture.complete(null); } public boolean isJobFailed() { return jobFailed; } public Throwable getJobFailureReason() { return jobFailureReason; }{code} If getJobFailureReason() is called between jobFailed and jobFailureReason being assigned then the test thread can unexpectedly observe a null jobFailureReason while jobFailed is true. The same race is present in master Happy to contribute a fix if someone could please assign me to the ticket > SourceCoordinatorTest.testErrorThrownFromSplitEnumerator fails on wrong error > ----------------------------------------------------------------------------- > > Key: FLINK-35438 > URL: https://issues.apache.org/jira/browse/FLINK-35438 > Project: Flink > Issue Type: Bug > Affects Versions: 1.18.2 > Reporter: Ryan Skraba > Priority: Critical > Labels: test-stability > > * 1.18 Java 11 / Test (module: core) > https://github.com/apache/flink/actions/runs/9201159842/job/25309197630#step:10:7375 > We expect to see an artificial {{Error("Test Error")}} being reported in the > test as the cause of a job failure, but the reported job failure is null: > {code} > Error: 02:32:31 02:32:31.950 [ERROR] Tests run: 18, Failures: 1, Errors: 0, > Skipped: 0, Time elapsed: 0.187 s <<< FAILURE! - in > org.apache.flink.runtime.source.coordinator.SourceCoordinatorTest > Error: 02:32:31 02:32:31.950 [ERROR] > org.apache.flink.runtime.source.coordinator.SourceCoordinatorTest.testErrorThrownFromSplitEnumerator > Time elapsed: 0.01 s <<< FAILURE! > May 23 02:32:31 org.opentest4j.AssertionFailedError: > May 23 02:32:31 > May 23 02:32:31 expected: > May 23 02:32:31 java.lang.Error: Test Error > May 23 02:32:31 at > org.apache.flink.runtime.source.coordinator.SourceCoordinatorTest.testErrorThrownFromSplitEnumerator(SourceCoordinatorTest.java:296) > May 23 02:32:31 at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > May 23 02:32:31 at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > May 23 02:32:31 ...(57 remaining lines not displayed - this can be > changed with Assertions.setMaxStackTraceElementsDisplayed) > May 23 02:32:31 but was: > May 23 02:32:31 null > May 23 02:32:31 at > java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > May 23 02:32:31 at > java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > May 23 02:32:31 at > java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > May 23 02:32:31 at > org.apache.flink.runtime.source.coordinator.SourceCoordinatorTest.testErrorThrownFromSplitEnumerator(SourceCoordinatorTest.java:322) > May 23 02:32:31 at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > May 23 02:32:31 at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > May 23 02:32:31 at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > May 23 02:32:31 at > java.base/java.lang.reflect.Method.invoke(Method.java:566) > May 23 02:32:31 at > org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:727) > May 23 02:32:31 at > org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60) > May 23 02:32:31 at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131) > May 23 02:32:31 at > org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156) > May 23 02:32:31 at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:147) > May 23 02:32:31 at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:86) > May 23 02:32:31 at > org.junit.jupiter.engine.execution.InterceptingExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(InterceptingExecutableInvoker.java:103) > May 23 02:32:31 at > org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.lambda$invoke$0(InterceptingExecutableInvoker.java:93) > May 23 02:32:31 at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106) > May 23 02:32:31 at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64) > May 23 02:32:31 at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45) > May 23 02:32:31 at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37) > May 23 02:32:31 at > org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:92) > May 23 02:32:31 at > org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:86) > {code} > This looks like it's a multithreading error with the test > {{MockOperatorCoordinatorContext}}, perhaps where {{isJobFailure}} can return > true before the reason has been populated. I couldn't reproduce it after > running it 1M times. -- This message was sent by Atlassian Jira (v8.20.10#820010)