Re: [PR] [Fix][Zeta] Remove zombie job entry in terminal state during master s… [seatunnel]

via GitHub Thu, 02 Apr 2026 08:09:58 -0700


ricky2129 commented on code in PR #10692:
URL: https://github.com/apache/seatunnel/pull/10692#discussion_r3028663207



##########
seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/CoordinatorService.java:
##########
@@ -721,6 +721,16 @@ private void restoreJobFromMasterActiveSwitch(@NonNull 
Long jobId, @NonNull JobI
             return;
         }
 
+        if (jobState instanceof JobStatus && ((JobStatus) 
jobState).isEndState()) {
+            logger.warning(
+                    String.format(
+                            "Job %s is in terminal state %s in 
runningJobInfoIMap, "
+                                    + "removing zombie entry to prevent 
incorrect restore",
+                            jobId, jobState));
+            runningJobInfoIMap.remove(jobId);
+            return;
+        }

Review Comment:
   @dybyte Thanks — you're right, our fix leaves orphan entries in 
runningJobStateIMap and
     runningJobStateTimestampsIMap (all pipeline/task keys that removeJobIMap() 
normally handles).
   
     neverNeedRestore() on its own is just this.needRestore = false — a flag 
setter used
     only by SubPlan.canRestorePipeline(). It doesn't trigger cleanup.
   
     Following the normal restore path (pending queue + 
updateJobState(PENDING)) would
     eventually reach cleanJob() but introduces a small risk: PENDING is not a 
terminal
     state, so if the coordinator crashes in that window, the zombie re-enters 
the restore
     loop with needRestore=true on a fresh instance — exactly the original bug, 
now with
     a stale checkpoint.
   
     Cleaner approach: call jobMaster.init() to build the PhysicalPlan 
structure, then
     call jobMaster.cleanJob() directly (it's public) — bypassing the pending 
queue
     entirely. No updateJobState(PENDING), no re-execution risk. Wrap in 
try-catch so if
     init() fails (e.g. S3 unavailable), we fall back to direct 
runningJobInfoIMap.remove().
   
   Happy to update the PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Fix][Zeta] Remove zombie job entry in terminal state during master s… [seatunnel]

Reply via email to