zhangshenghang opened a new pull request, #10686:
URL: https://github.com/apache/seatunnel/pull/10686

   ### Purpose of this pull request
   
   This PR fixes an engine-side terminal-state convergence bug after worker 
node failure.
   
   When a worker goes offline, the engine can start cleaning distributed state 
from the running job state maps before all asynchronous task/pipeline/job 
callbacks have finished. In the current code path, `PhysicalVertex`, `SubPlan`, 
and `PhysicalPlan` all assume the current state still exists in the distributed 
map and call `current.equals(...)` / `current.isEndState()` directly. If the 
state has already been cleaned, those callbacks can throw 
`NullPointerException` and interrupt terminal-state convergence.
   
   This PR makes those state transitions null-safe by:
   - falling back to the local in-memory state when the distributed state has 
already been cleaned
   - skipping timestamp persistence when the timestamp map entry has already 
been removed
   - avoiding re-writing already-cleaned state back into the distributed maps
   
   It also adds targeted regression tests for the cleanup-race path and an 
engine E2E scenario for the `BATCH + no checkpoint + job.retry.times=0` failure 
path, so node shutdown must converge to a terminal state instead of hanging in 
the middle.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No user-facing API/config change. This changes failure handling so jobs are 
less likely to hang in an intermediate state after node failure.
   
   ### How was this patch tested?
   
   Verified locally:
   - `./mvnw -nsu -pl seatunnel-engine/seatunnel-engine-server spotless:check`
   - `./mvnw -nsu -pl 
seatunnel-e2e/seatunnel-engine-e2e/connector-seatunnel-e2e-base spotless:check`
   
   Additional notes:
   - Added targeted regression test: `StateTransitionCleanupTest`
   - Added engine E2E coverage: `ClusterFailureNoRestoreIT`
   - Full Maven test/compile validation in this checkout is currently blocked 
by unrelated upstream build issues in other reactor modules (for example 
`seatunnel-config-shade`), so this PR is opened as draft.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to