[PR] [FLINK-38049] Fix stateless job incorrectly using last-state restore when HA is enabled [flink-kubernetes-operator]

via GitHub Tue, 23 Jun 2026 05:57:59 -0700


lucasgameiroborges opened a new pull request, #1147:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1147


   ## What is the purpose of the change
   
   When HA is enabled and a stateless job is resubmitted (e.g. due to an 
unhealthy cluster), `resubmitJob` unconditionally overrode the upgrade mode to 
`LAST_STATE` and passed `requireHaMetadata=true` to `restoreJob`, ignoring the 
user-configured `STATELESS` upgrade mode. This caused the job to attempt a 
last-state restore from HA metadata instead of starting fresh.
   
   ## Brief change log
   
   - `AbstractJobReconciler#resubmitJob`: skip the `LAST_STATE` mode override 
and the `requireHaMetadata` flag when the spec's upgrade mode is `STATELESS`
   - Add 
`ApplicationReconcilerTest#testRestartUnhealthyStatelessJobWithHaEnabled` to 
reproduce the scenario: a stateless job with HA enabled is resubmitted after an 
unhealthy event with no HA metadata available — previously this would throw 
`UpgradeFailureException`, now it succeeds
   
   ## Verifying this change
   
   New unit test `testRestartUnhealthyStatelessJobWithHaEnabled` in 
`ApplicationReconcilerTest` covers the fix. Existing 
`ApplicationReconcilerTest` and `ApplicationReconcilerUpgradeModeTest` suites 
(109 tests total) continue to pass.
   
   ## Does this pull request potentially affect one of the following areas
   
   - Job lifecycle/upgrade: **yes** — affects resubmit path for stateless jobs 
when HA is active
   
   _This is a fix for 
[FLINK-38049](https://issues.apache.org/jira/browse/FLINK-38049)._


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [FLINK-38049] Fix stateless job incorrectly using last-state restore when HA is enabled [flink-kubernetes-operator]

Reply via email to