Prashant Bhardwaj created FLINK-39704:
-----------------------------------------

             Summary: Kubernetes HA can recover a globally terminal FAILED 
application job after leadership revoke/reacquire
                 Key: FLINK-39704
                 URL: https://issues.apache.org/jira/browse/FLINK-39704
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 2.2.0, 2.4.0
            Reporter: Prashant Bhardwaj
         Attachments: jm-ha-reanimation-repro-current-2026-05-18.log, 
jm-ha-reanimation-repro-events-2026-05-18.txt, 
jm-ha-reanimation-repro-pod-describe-2026-05-18.txt, 
jm-ha-reanimation-repro-previous-2026-05-18.log

In a Kubernetes HA application cluster, a job that has already reached the 
globally terminal FAILED state can be recovered and restarted with the same 
JobID if Kubernetes leadership is revoked/reacquired immediately after the 
terminal transition.

Observed with apache/flink:2.2.0 and Kubernetes HA.

*Timeline from repro:*
{noformat}
20:52:51.075  Task failure after TaskManager deletion
20:52:51.119  Job e7ce38da0a5b4651ce64453d6ffaa25b switched RUNNING -> FAILING
20:52:51.122  Job e7ce38da0a5b4651ce64453d6ffaa25b switched FAILING -> FAILED
20:52:52.615  KubernetesLeaderElector observed empty leader holder
20:52:52.616  Leadership revoked
20:52:52.618  Dispatcher reported same job as terminal SUSPENDED
20:52:52.921  DefaultExecutionPlanStore released execution plan 
e7ce38da0a5b4651ce64453d6ffaa25b
20:52:52.926  Same job id was retrieved from KubernetesStateHandleStore
20:52:53.035  Same StreamGraph(jobId: e7ce38da0a5b4651ce64453d6ffaa25b) was 
recovered
20:53:11.340  Same job switched CREATED -> RUNNING
{noformat}

*Expected:*
Once a job reaches globally terminal FAILED, later leadership revocation/close 
should not overwrite or mask the globally terminal result as SUSPENDED. HA 
metadata should be cleaned up as a globally terminal job, and the same job 
should not be recovered.

*Actual:*
Leadership revocation closes the running JobMaster/Dispatcher path with 
synthetic SUSPENDED after the real FAILED result. The execution plan is 
released rather than permanently removed, so the same job id remains 
recoverable from Kubernetes HA storage and is started again.

A secondary issue is also visible in the same churn window:
DefaultLeaderElectionService receives a grant while issuedLeaderSessionID is 
already set and throws:

java.lang.IllegalStateException:
The leadership should have been granted while not having the leadership 
acquired.

This crashes the JobManager entrypoint, but the reanimation has already 
happened before the fatal error: the failed job was released/recovered from HA 
metadata.

*Reproduction outline:*
1. Run a Kubernetes HA application cluster with restart-strategy.type: none.
2. Use a persistent HA storage dir.
3. Delete the TaskManager so the job reaches FAILED.
4. Immediately after observing RUNNING -> FAILING, patch the cluster leader 
ConfigMap annotation holderIdentity to empty, forcing leadership loss/reacquire.
5. Observe FAILED followed by SUSPENDED/release/recovery of the same JobID.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to