[ 
https://issues.apache.org/jira/browse/FLINK-35438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849850#comment-17849850
 ] 

Rob Young edited comment on FLINK-35438 at 5/28/24 3:45 AM:
------------------------------------------------------------

I agree there's a race that's hard to reproduce, I can only provoke it by 
adding in a thread sleep in the spot where it can occur in 
`MockOperatorCoordinatorContext`
{code:java}
@Override
public void failJob(Throwable cause) {
    jobFailed = true;
    try {
        Thread.sleep(50);
    } catch (InterruptedException e) {
        throw new RuntimeException(e);
    }
    jobFailureReason = cause;
    jobFailedFuture.complete(null);
} 

public boolean isJobFailed() {
    return jobFailed;
}

public Throwable getJobFailureReason() {
    return jobFailureReason;
}{code}
If getJobFailureReason() is called between jobFailed and jobFailureReason being 
assigned then the test thread can unexpectedly observe a null jobFailureReason 
while jobFailed is true.

The same race is present in master.

Happy to contribute a fix if someone could please assign me to the ticket. 
Synchronizing access to the two fields would be one way. Perhaps you could 
replace the two fields with one that contains the failed bool and reason so 
they are assigned together.


was (Author: JIRAUSER298079):
I agree there's a race that's hard to reproduce, I can only provoke it by 
adding in a thread sleep in the spot where it can occur in 
`MockOperatorCoordinatorContext`


{code:java}
@Override
public void failJob(Throwable cause) {
    jobFailed = true;
    try {
        Thread.sleep(50);
    } catch (InterruptedException e) {
        throw new RuntimeException(e);
    }
    jobFailureReason = cause;
    jobFailedFuture.complete(null);
} 

public boolean isJobFailed() {
    return jobFailed;
}

public Throwable getJobFailureReason() {
    return jobFailureReason;
}{code}
If getJobFailureReason() is called between jobFailed and jobFailureReason being 
assigned then the test thread can unexpectedly observe a null jobFailureReason 
while jobFailed is true.

The same race is present in master

Happy to contribute a fix if someone could please assign me to the ticket

> SourceCoordinatorTest.testErrorThrownFromSplitEnumerator fails on wrong error
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-35438
>                 URL: https://issues.apache.org/jira/browse/FLINK-35438
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.18.2
>            Reporter: Ryan Skraba
>            Priority: Critical
>              Labels: test-stability
>
> * 1.18 Java 11 / Test (module: core) 
> https://github.com/apache/flink/actions/runs/9201159842/job/25309197630#step:10:7375
> We expect to see an artificial {{Error("Test Error")}} being reported in the 
> test as the cause of a job failure, but the reported job failure is null:
> {code}
> Error: 02:32:31 02:32:31.950 [ERROR] Tests run: 18, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 0.187 s <<< FAILURE! - in 
> org.apache.flink.runtime.source.coordinator.SourceCoordinatorTest
> Error: 02:32:31 02:32:31.950 [ERROR] 
> org.apache.flink.runtime.source.coordinator.SourceCoordinatorTest.testErrorThrownFromSplitEnumerator
>   Time elapsed: 0.01 s  <<< FAILURE!
> May 23 02:32:31 org.opentest4j.AssertionFailedError: 
> May 23 02:32:31 
> May 23 02:32:31 expected: 
> May 23 02:32:31   java.lang.Error: Test Error
> May 23 02:32:31       at 
> org.apache.flink.runtime.source.coordinator.SourceCoordinatorTest.testErrorThrownFromSplitEnumerator(SourceCoordinatorTest.java:296)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> May 23 02:32:31       ...(57 remaining lines not displayed - this can be 
> changed with Assertions.setMaxStackTraceElementsDisplayed)
> May 23 02:32:31  but was: 
> May 23 02:32:31   null
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> May 23 02:32:31       at 
> org.apache.flink.runtime.source.coordinator.SourceCoordinatorTest.testErrorThrownFromSplitEnumerator(SourceCoordinatorTest.java:322)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> May 23 02:32:31       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> May 23 02:32:31       at 
> java.base/java.lang.reflect.Method.invoke(Method.java:566)
> May 23 02:32:31       at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:727)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:147)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:86)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(InterceptingExecutableInvoker.java:103)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.lambda$invoke$0(InterceptingExecutableInvoker.java:93)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:92)
> May 23 02:32:31       at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:86)
> {code}
> This looks like it's a multithreading error with the test 
> {{MockOperatorCoordinatorContext}}, perhaps where {{isJobFailure}} can return 
> true before the reason has been populated. I couldn't reproduce it after 
> running it 1M times.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to