MartijnVisser opened a new pull request, #28634:
URL: https://github.com/apache/flink/pull/28634
## What is the purpose of the change
Fixes the recurring `KeyedComplexChainTest.testMigrationAndRestore` failures
(`NoResourceAvailableException`, Azure builds 76400, 76448, 76627), the
follow-up investigation deferred in FLINK-39918.
`KeyedJob.StatefulWindowFunction.close()` asserted `applyCalled` on every
close, but `close()` also runs on the cancellation path, and the
GENERATE/MIGRATE jobs are always stopped via non-draining
cancel-with-savepoint, so under CI load a window subtask can be closed before
its element was processed. The `AssertionFailedError` thrown from `close()`
fails the shared static MiniCluster's only TaskManager (NUM_TMS=1, no restart),
so the subsequent restore step and every later savepoint parameterization
starve with `NoResourceAvailableException` after the slot-request timeout.
There is no slot leak and no data loss: the runtime released all slots
correctly, and the window's keyed state is restored from the savepoint
independently of `apply()` having run.
## Brief change log
- Restrict `StatefulWindowFunction.close()`'s `applyCalled` assertion to
`ExecutionMode.RESTORE`, the only mode whose job runs to completion.
## Verifying this change
This change is already covered by existing tests: `KeyedComplexChainTest`
ran 3x locally, 16/16 green each time. The original failure needs
cancel-with-savepoint to beat element delivery under CI load and is not
reproducible locally.
Coverage trade-off: MIGRATE loses the `applyCalled` fail-safe, but the
RESTORE run re-validates the migrated state end-to-end via the untouched
`apply()` state-comparison assertions, so migration correctness is still
verified.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
---
##### Was generative AI tooling used to co-author this PR?
- [X] Yes (Claude Opus 4.8, via Claude Code)
Generated-by: Claude Opus 4.8 (1M context)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]