dybyte commented on issue #10675: URL: https://github.com/apache/seatunnel/issues/10675#issuecomment-4168870058
> While tracing the full failure chain for this incident, we found a second related gap that explains > why jobs appeared RUNNING in the UI while producing zero checkpoints for days. I think this might be a separate issue. In a normal `RUNNING` state, the coordinator does not send cancel operations to tasks, so I’m not fully convinced that the missing `CancelTaskOperation` alone explains why the job stayed `RUNNING` in the UI. > When the coordinator pod is killed mid-cleanup (the race condition described in this issue), the > worker's Debezium reader never receives CancelTaskOperation. That operation is sent from the > coordinator thread via PhysicalVertex.noticeTaskExecutionServiceCancel() — when the pod dies, > the RPC is never delivered. The worker runs indefinitely as an orphan: data flows to the sink, but > with no checkpoint coordinator alive, no barriers are injected and no S3 writes happen. The > checkpoint position is frozen at the moment the coordinator died. #10506 seems to address a similar scenario, although it’s not merged yet. For now, force-stop API can be used as a workaround in this situation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
