dybyte commented on issue #10675:
URL: https://github.com/apache/seatunnel/issues/10675#issuecomment-4168870058

   > While tracing the full failure chain for this incident, we found a second 
related gap that explains
   > why jobs appeared RUNNING in the UI while producing zero checkpoints for 
days.
   
   I think this might be a separate issue.
   In a normal `RUNNING` state, the coordinator does not send cancel operations 
to tasks, so I’m not fully convinced that the missing `CancelTaskOperation` 
alone explains why the job stayed `RUNNING` in the UI.
   
   > When the coordinator pod is killed mid-cleanup (the race condition 
described in this issue), the
   > worker's Debezium reader never receives CancelTaskOperation. That 
operation is sent from the
   > coordinator thread via PhysicalVertex.noticeTaskExecutionServiceCancel() — 
when the pod dies,
   > the RPC is never delivered. The worker runs indefinitely as an orphan: 
data flows to the sink, but
   > with no checkpoint coordinator alive, no barriers are injected and no S3 
writes happen. The
   > checkpoint position is frozen at the moment the coordinator died.
   
   #10506 seems to address a similar scenario, although it’s not merged yet. 
For now, force-stop API can be used as a workaround in this situation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to