[ 
https://issues.apache.org/jira/browse/FLINK-38536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086727#comment-18086727
 ] 

Martijn Visser commented on FLINK-38536:
----------------------------------------

>From my AI analyzer:

The failures are a test-harness data race, not a production bug. 
FinalizeOnMasterTest (and its sibling ExecutionGraphFinishTest) wired the 
scheduler with ComponentMainThreadExecutorServiceAdapter.forMainThread() as the 
JobManager main-thread executor while passing a separate single-threaded I/O 
executor. forMainThread() is backed by DirectScheduledExecutorService, whose 
execute() runs the command inline on the calling thread rather than on one 
dedicated main thread.

Execution#deploy() builds the TaskDeploymentDescriptor on the I/O executor via 
CompletableFuture.supplyAsync(..., ioExecutor), then composes the continuation 
back to the main thread with thenComposeAsync(..., mainThreadExecutor). Because 
forMainThread() executes inline, those continuations — TDD creation, task 
submission, and the deployment-completion handling that can call markFailed — 
ran on the I/O thread, concurrently with the test thread still inside 
startScheduling() (and, in ExecutionGraphFinishTest, later mutating state via 
markFinished()). The unsynchronized concurrent mutation of the execution graph 
produced both observed signatures:

- IllegalStateException: BUG: trying to schedule a region which is not in 
CREATED state (region vertices mutated mid-scheduling), and
- expected: RUNNING but was: FAILING (a background deployment callback failed 
an execution and triggered failover).

The async I/O-executor deploy path was introduced by FLINK-38114 (asynchronous 
offloading of TaskRestore), which is why these tests started flaking now. 
Production scheduling code is unchanged.

Fix

Run all execution-graph interactions on a dedicated single-threaded JobManager 
main-thread executor (forSingleThreadExecutor over 
TestingUtils.jmMainThreadExecutorExtension()), keeping the separate I/O 
executor, so the deployment callbacks are serialized with the test logic 
instead of racing on the I/O thread. No assertion or functional test logic 
changed. The fix mirrors the pattern already used by ExecutionGraphSuspendTest 
and SchedulerTestingUtils. The same remedy is applied preemptively to 
ExecutionGraphFinishTest, which shares the wiring and code path.

> FinalizeOnMasterTest failed in test_cron_hadoop313 core
> -------------------------------------------------------
>
>                 Key: FLINK-38536
>                 URL: https://issues.apache.org/jira/browse/FLINK-38536
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 2.2.0
>            Reporter: Ruan Hang
>            Priority: Major
>              Labels: pull-request-available
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=70334&view=logs&j=d89de3df-4600-5585-dadc-9bbc9a5e661c&t=d0a8f428-a6bd-5382-3317-923776675ef0
> {noformat}
> Oct 19 01:45:54 01:45:54.121 [ERROR] Tests run: 2, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 0.440 s <<< FAILURE! -- in 
> org.apache.flink.runtime.executiongraph.FinalizeOnMasterTest
> Oct 19 01:45:54 01:45:54.121 [ERROR] 
> org.apache.flink.runtime.executiongraph.FinalizeOnMasterTest.testFinalizeIsCalledUponSuccess
>  -- Time elapsed: 0.038 s <<< FAILURE!
> Oct 19 01:45:54 org.opentest4j.AssertionFailedError: 
> Oct 19 01:45:54 
> Oct 19 01:45:54 expected: RUNNING
> Oct 19 01:45:54  but was: FAILING
> Oct 19 01:45:54       at 
> org.apache.flink.runtime.executiongraph.FinalizeOnMasterTest.testFinalizeIsCalledUponSuccess(FinalizeOnMasterTest.java:72)
> Oct 19 01:45:54       at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
> Oct 19 01:45:54       at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
> Oct 19 01:45:54       at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
> Oct 19 01:45:54       at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
> Oct 19 01:45:54       at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
> Oct 19 01:45:54       at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to