ricky2129 commented on code in PR #10692:
URL: https://github.com/apache/seatunnel/pull/10692#discussion_r3028663207
##########
seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/CoordinatorService.java:
##########
@@ -721,6 +721,16 @@ private void restoreJobFromMasterActiveSwitch(@NonNull
Long jobId, @NonNull JobI
return;
}
+ if (jobState instanceof JobStatus && ((JobStatus)
jobState).isEndState()) {
+ logger.warning(
+ String.format(
+ "Job %s is in terminal state %s in
runningJobInfoIMap, "
+ + "removing zombie entry to prevent
incorrect restore",
+ jobId, jobState));
+ runningJobInfoIMap.remove(jobId);
+ return;
+ }
Review Comment:
@dybyte Thanks — you're right, our fix leaves orphan entries in
runningJobStateIMap and
runningJobStateTimestampsIMap (all pipeline/task keys that removeJobIMap()
normally handles).
neverNeedRestore() on its own is just this.needRestore = false — a flag
setter used
only by SubPlan.canRestorePipeline(). It doesn't trigger cleanup.
Following the normal restore path (pending queue +
updateJobState(PENDING)) would
eventually reach cleanJob() but introduces a small risk: PENDING is not a
terminal
state, so if the coordinator crashes in that window, the zombie re-enters
the restore
loop with needRestore=true on a fresh instance — exactly the original bug,
now with
a stale checkpoint.
Cleaner approach: call jobMaster.init() to build the PhysicalPlan
structure, then
call jobMaster.cleanJob() directly (it's public) — bypassing the pending
queue
entirely. No updateJobState(PENDING), no re-execution risk. Wrap in
try-catch so if
init() fails (e.g. S3 unavailable), we fall back to direct
runningJobInfoIMap.remove().
Happy to update the PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]