[ 
https://issues.apache.org/jira/browse/FLINK-35438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849850#comment-17849850
 ] 

Rob Young commented on FLINK-35438:
-----------------------------------

I agree there's a race that's hard to reproduce, I can only provoke it by 
adding in a thread sleep in the spot where it can occur in 
`MockOperatorCoordinatorContext`


{code:java}
@Override
public void failJob(Throwable cause) {
    jobFailed = true;
    try {
        Thread.sleep(50);
    } catch (InterruptedException e) {
        throw new RuntimeException(e);
    }
    jobFailureReason = cause;
    jobFailedFuture.complete(null);
} 

public boolean isJobFailed() {
    return jobFailed;
}

public Throwable getJobFailureReason() {
    return jobFailureReason;
}{code}
If getJobFailureReason() is called between jobFailed and jobFailureReason being 
assigned then the test thread can unexpectedly observe a null jobFailureReason 
while jobFailed is true.

The same race is present in master

Happy to contribute a fix if someone could please assign me to the ticket

> SourceCoordinatorTest.testErrorThrownFromSplitEnumerator fails on wrong error
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-35438
>                 URL: https://issues.apache.org/jira/browse/FLINK-35438
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.18.2
>            Reporter: Ryan Skraba
>            Priority: Critical
>              Labels: test-stability
>
> * 1.18 Java 11 / Test (module: core) 
> https://github.com/apache/flink/actions/runs/9201159842/job/25309197630#step:10:7375
> We expect to see an artificial {{Error("Test Error")}} being reported in the 
> test as the cause of a job failure, but the reported job failure is null:
> {code}
> Error: 02:32:31 02:32:31.950 [ERROR] Tests run: 18, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 0.187 s <<< FAILURE! - in 
> org.apache.flink.runtime.source.coordinator.SourceCoordinatorTest
> Error: 02:32:31 02:32:31.950 [ERROR] 
> org.apache.flink.runtime.source.coordinator.SourceCoordinatorTest.testErrorThrownFromSplitEnumerator
>   Time elapsed: 0.01 s  <<< FAILURE!
> May 23 02:32:31 org.opentest4j.AssertionFailedError: 
> May 23 02:32:31 
> May 23 02:32:31 expected: 
> May 23 02:32:31   java.lang.Error: Test Error
> May 23 02:32:31       at 
> org.apache.flink.runtime.source.coordinator.SourceCoordinatorTest.testErrorThrownFromSplitEnumerator(SourceCoordinatorTest.java:296)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> May 23 02:32:31       ...(57 remaining lines not displayed - this can be 
> changed with Assertions.setMaxStackTraceElementsDisplayed)
> May 23 02:32:31  but was: 
> May 23 02:32:31   null
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> May 23 02:32:31       at 
> org.apache.flink.runtime.source.coordinator.SourceCoordinatorTest.testErrorThrownFromSplitEnumerator(SourceCoordinatorTest.java:322)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> May 23 02:32:31       at 
> java.base/java.lang.reflect.Method.invoke(Method.java:566)
> May 23 02:32:31       at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:727)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:147)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:86)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(InterceptingExecutableInvoker.java:103)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.lambda$invoke$0(InterceptingExecutableInvoker.java:93)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:92)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:86)
> {code}
> This looks like it's a multithreading error with the test 
> {{MockOperatorCoordinatorContext}}, perhaps where {{isJobFailure}} can return 
> true before the reason has been populated. I couldn't reproduce it after 
> running it 1M times.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to