MartijnVisser opened a new pull request, #28642:
URL: https://github.com/apache/flink/pull/28642

   ## What is the purpose of the change
   
   Backport of the FLINK-38441 and FLINK-39182 test-stability fixes from master 
to release-2.2 (both are already on release-2.3). The two JIRAs are bundled 
because they fix the same root cause in two sibling test classes; each is a 
separate clean `-x` cherry-pick, so individual reverts remain possible. This PR 
is independent of the parallel FLINK-39921/FLINK-39929 backport PR.
   
   Observed on release-2.2 nightlies:
   
     - `ExecutionGraphRestartTest.testCancelWhileFailing` (expected RUNNING but 
was FAILING) and `testFailingExecutionAfterRestart` (expected FINISHED but was 
FAILED) in [build 
76712](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76712&view=results),
 `test_ci core` leg.
     - `ExecutionGraphCoLocationRestartTest.testConstraintsAfterRestart` timing 
out in `waitForAllExecutionsPredicate` in [build 
76611](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76611&view=results)
 and [build 
76677](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76677&view=results),
 adaptive-scheduler core legs.
   
   Root cause: since FLINK-38114, TaskDeploymentDescriptor creation is 
offloaded to the I/O executor and deploy continuations complete on background 
threads, tripping the thread-identity assertion of the 
`ComponentMainThreadExecutorServiceAdapter.forMainThread()` test executor. The 
assertion failure surfaces as a deployment failure (spurious FAILING/FAILED job 
state) or as executions never reaching DEPLOYING (predicate timeout). Both 
master fixes move the tests onto a real single-threaded main-thread executor 
instead. Test-only change (including the `SlotPoolUtils` test helper, whose 
public API is unchanged); the production scheduler is not affected.
   
   ## Brief change log
   
   Clean `git cherry-pick -x` of the merged master fixes (byte-identical, 
original authorship preserved):
   
     - FLINK-38441 (master 94c65494b20, #27060): run 
`ExecutionGraphCoLocationRestartTest` via `TestingComponentMainThreadExecutor`.
     - FLINK-39182 (master 3fb0d04d225): run `ExecutionGraphRestartTest` via 
`ComponentMainThreadExecutorServiceAdapter.forSingleThreadExecutor`, adjusting 
the `SlotPoolUtils` test helper (public method signatures preserved; only a 
private overload removed).
   
   ## Verifying this change
   
   This change is already covered by existing tests: ran 
`ExecutionGraphRestartTest` (7 tests) and `ExecutionGraphCoLocationRestartTest` 
(1 test) on this branch, 8/8 green (and 5 consecutive green runs of the same 
classes during preparation of the picks).
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (test-only)
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes (Claude Code)
   
   Generated-by: Claude Code (Claude Fable 5)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to