nateab commented on PR #27579:
URL: https://github.com/apache/flink/pull/27579#issuecomment-4849496026

   @peach12345 Thanks for confirming you're hitting this too, that helps make 
the case for the fix.
   
   A few options until it's merged:
   
   1) Recovering a job that's stuck in the loop right now: the stale state 
lives in the
   TaskManagers' cached classloaders (keyed per job), not the JobManager. 
Recycling the
   TaskManagers (e.g. deleting the TM pods so they come back fresh) clears the 
cached
   classloader that still holds the old blob keys, so the next deployment 
resolves cleanly.
   A full stop + resubmit works too, but restarting just the TMs is usually 
enough to break
   the loop.
   
   2) Avoiding the trigger: the mismatch only happens when the job's JARs get 
re-uploaded and
   produce new PermanentBlobKeys — the key includes a random component, so 
identical content
   still yields a different key. That re-upload typically happens on a 
JobManager failover, so:
       - keep the JobManager off spot/preemptible nodes (fewer JM restarts), and
       - make sure JobManager HA is enabled, so a failover recovers the 
persisted JobGraph
         (with its original blob keys) rather than resubmitting the job with 
freshly uploaded JARs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to