prashantbh opened a new pull request, #28201: URL: https://github.com/apache/flink/pull/28201
## What is the purpose of the change Fixes [FLINK-39704](https://issues.apache.org/jira/browse/FLINK-39704): a race between a JobMaster's globally terminal job completion and a leadership revocation that arrives in the same window. The terminal result (`FAILED`/`CANCELED`/`FINISHED`) was silently masked by a synthetic `SUSPENDED`, leaving the JobGraph and HA metadata behind. On the next leadership grant, the same `JobID` was recovered and reanimated as a zombie — observed with Application Mode on Kubernetes HA; the same race is above the `LeaderElection` SPI and can also affect session clusters and ZooKeeper HA. The fix has two parts that close two distinct orderings of the same underlying race. ## Brief change log - **`DefaultJobMasterServiceProcess.closeAsync()`** — defer `resultFuture.completeExceptionally(JobNotFinishedException)` until after `jobMasterService.closeAsync()` settles. Previously it ran synchronously inside `closeAsync()`, so an in-flight terminal completion from the scheduler (`jobReachedGloballyTerminalState`) would race the eager failure and lose; the scheduler's later `resultFuture.complete(...)` then became a no-op. After the change, the scheduler has its full drain window to publish, and `JobNotFinishedException` is only declared if nothing else completes the future first. - **`JobMasterServiceLeadershipRunner`** — cache a globally terminal result the moment it is observed on the result-forwarding path (`rememberGloballyTerminalResultIfCurrentProcess`), and flush it from either `completeResultFutureAfterClose` (on runner close) or `grantLeadership` (on re-grant). Drain is chained inside `sequentialOperation` so it runs only after the prior process's stop and forwarding have fully settled. Revoke intentionally does **not** clear `currentJobMasterServiceProcessLeaderId`, so late terminal callbacks from the closing process can still satisfy the cache's leader-equality check. ## Verifying this change This change added tests and can be verified as follows: - **`DefaultJobMasterServiceProcessTest#testTerminalResultPublishedDuringCloseWinsOverJobNotFinished`** — holds the `JobMasterService` termination future open, calls `serviceProcess.closeAsync()`, asserts the result is not yet completed, calls `jobReachedGloballyTerminalState(FAILED)` mid-close, then releases the service termination. Asserts the runner result is `FAILED`, not `JobNotFinishedException`. Fails on the old code, passes on the new code. - **`JobMasterServiceLeadershipRunnerTest#testCloseAsyncFlushesCachedGloballyTerminalResultAfterRevoke`** and **`#testGrantLeadershipFlushesCachedTerminalResultObservedAfterRevoke`** — cover the runner-level cache + flush behavior for the close path and the re-grant path respectively, with the terminal result completing after `notLeader()`. Both fail on a runner that fences `currentJobMasterServiceProcessLeaderId` on revoke. - **`JobMasterServiceLeadershipRunnerTest#testCloseAsyncCompletesWithGloballyTerminalResultObservedBeforeLeadershipRevoke`** — parameterized over `FAILED`/`FINISHED`/`CANCELED`, covers the symmetric ordering (terminal observed before revoke). - **End-to-end repro** — the pre-fix end-to-end repro is documented in the JIRA description; the added regression tests cover the two code-level race orderings. Run locally with: ``` mvn -pl flink-runtime -am -Dtest='JobMasterServiceLeadershipRunnerTest,DefaultJobMasterServiceProcessTest' test ``` ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): **no** - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no** - The serializers: **no** - The runtime per-record code paths (performance sensitive): **no** - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: **yes** — touches the JobManager leadership runner and the JobMaster process lifecycle; relevant for both Kubernetes HA and ZooKeeper HA since the race lives above the `LeaderElection` SPI - The S3 file system connector: **no** ## Documentation - Does this pull request introduce a new feature? **no** (bug fix) - If yes, how is the feature documented? **not applicable** --- ##### Was generative AI tooling used to co-author this PR? - [X] Yes — Claude and OpenAI Codex were used as pair-programming assistants for analysis and test drafting. The author reviewed and validated the final code and tests. Generated-by: Claude (Opus 4.7); Codex (GPT-5.5) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
