MartijnVisser opened a new pull request, #28642:
URL: https://github.com/apache/flink/pull/28642
## What is the purpose of the change
Backport of the FLINK-38441 and FLINK-39182 test-stability fixes from master
to release-2.2 (both are already on release-2.3). The two JIRAs are bundled
because they fix the same root cause in two sibling test classes; each is a
separate clean `-x` cherry-pick, so individual reverts remain possible. This PR
is independent of the parallel FLINK-39921/FLINK-39929 backport PR.
Observed on release-2.2 nightlies:
- `ExecutionGraphRestartTest.testCancelWhileFailing` (expected RUNNING but
was FAILING) and `testFailingExecutionAfterRestart` (expected FINISHED but was
FAILED) in [build
76712](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76712&view=results),
`test_ci core` leg.
- `ExecutionGraphCoLocationRestartTest.testConstraintsAfterRestart` timing
out in `waitForAllExecutionsPredicate` in [build
76611](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76611&view=results)
and [build
76677](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76677&view=results),
adaptive-scheduler core legs.
Root cause: since FLINK-38114, TaskDeploymentDescriptor creation is
offloaded to the I/O executor and deploy continuations complete on background
threads, tripping the thread-identity assertion of the
`ComponentMainThreadExecutorServiceAdapter.forMainThread()` test executor. The
assertion failure surfaces as a deployment failure (spurious FAILING/FAILED job
state) or as executions never reaching DEPLOYING (predicate timeout). Both
master fixes move the tests onto a real single-threaded main-thread executor
instead. Test-only change (including the `SlotPoolUtils` test helper, whose
public API is unchanged); the production scheduler is not affected.
## Brief change log
Clean `git cherry-pick -x` of the merged master fixes (byte-identical,
original authorship preserved):
- FLINK-38441 (master 94c65494b20, #27060): run
`ExecutionGraphCoLocationRestartTest` via `TestingComponentMainThreadExecutor`.
- FLINK-39182 (master 3fb0d04d225): run `ExecutionGraphRestartTest` via
`ComponentMainThreadExecutorServiceAdapter.forSingleThreadExecutor`, adjusting
the `SlotPoolUtils` test helper (public method signatures preserved; only a
private overload removed).
## Verifying this change
This change is already covered by existing tests: ran
`ExecutionGraphRestartTest` (7 tests) and `ExecutionGraphCoLocationRestartTest`
(1 test) on this branch, 8/8 green (and 5 consecutive green runs of the same
classes during preparation of the picks).
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (test-only)
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? not applicable
---
##### Was generative AI tooling used to co-author this PR?
- [X] Yes (Claude Code)
Generated-by: Claude Code (Claude Fable 5)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]