dybyte commented on issue #10675:
URL: https://github.com/apache/seatunnel/issues/10675#issuecomment-4170102558

   > [@dybyte](https://github.com/dybyte) Thanks for the clarification — agreed 
this should be a separate issue.
   > 
   > To clarify the two symptoms separately:
   > 
   > * Job showing RUNNING in UI = zombie entry remaining in runningJobInfoIMap
   >   (the IMap cleanup race, covered by this fix)
   > * Worker running 14 days with no checkpoints = CancelTaskOperation never
   >   delivered because coordinator died before cleanJob() completed
   > 
   > You're right that our proposed fix direction has a gap — canceling all 
task groups when the deploying coordinator departs would also fire during 
normal master failover, incorrectly canceling healthy jobs. We hadn't 
considered that case fully.
   
   Got it, thanks for the clarification.
   
   It seems there are two separate issues involved here:
   
   1) The orphan task / missing Operation scenario, which appears to be related 
to #10506 and can be mitigated with the force-stop API for now.
   
   2) The zombie entry in runningJobInfoIMap due to the cleanup race, which is 
the main issue being addressed in this discussion.
   
   So I think the current fix should focus on the cleanup logic described in 
the main issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to