[jira] [Commented] (FLINK-32212) Job restarting indefinitely after an IllegalStateException from BlobLibraryCacheManager
[ https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843461#comment-17843461 ] Ricky Saltzer commented on FLINK-32212: --- Hey [~rmetzger] - I'm also having a hard time reproducing at the moment but from what I can tell...the issue happens when the JobManager is lost and recreated, but the TaskManager K8s pods remain online (to be reused) by the new JobManager. It does appear to be somewhat inconsistent in that way too, because the fix for me is to manually delete the JobManager pod and hope the next startup works (usually does). > Job restarting indefinitely after an IllegalStateException from > BlobLibraryCacheManager > --- > > Key: FLINK-32212 > URL: https://issues.apache.org/jira/browse/FLINK-32212 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.16.1 > Environment: Apache Flink Kubernetes Operator 1.4 >Reporter: Matheus Felisberto >Priority: Major > > After running for a few hours the job starts to throw IllegalStateException > and I can't figure out why. To restore the job, I need to manually delete the > FlinkDeployment to be recreated and redeploy everything. > The jar is built-in into the docker image, hence is defined accordingly with > the Operator's documentation: > {code:java} > // jarURI: local:///opt/flink/usrlib/my-job.jar {code} > I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work > either. > > {code:java} > // Source: my-topic (1/2)#30587 > (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587) > switched from DEPLOYING to FAILED with failure cause: > java.lang.IllegalStateException: The library registration references a > different set of library BLOBs than previous registrations for this job: > old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396] > new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) > at java.base/java.lang.Thread.run(Unknown Source) {code} > If there is any other information that can help to identify the problem, > please let me know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32212) Job restarting indefinitely after an IllegalStateException from BlobLibraryCacheManager
[ https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837354#comment-17837354 ] Ricky Saltzer commented on FLINK-32212: --- Verified we started hitting this pretty consistently when using K8s + Karptenter with spot instances. > Job restarting indefinitely after an IllegalStateException from > BlobLibraryCacheManager > --- > > Key: FLINK-32212 > URL: https://issues.apache.org/jira/browse/FLINK-32212 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.16.1 > Environment: Apache Flink Kubernetes Operator 1.4 >Reporter: Matheus Felisberto >Priority: Major > > After running for a few hours the job starts to throw IllegalStateException > and I can't figure out why. To restore the job, I need to manually delete the > FlinkDeployment to be recreated and redeploy everything. > The jar is built-in into the docker image, hence is defined accordingly with > the Operator's documentation: > {code:java} > // jarURI: local:///opt/flink/usrlib/my-job.jar {code} > I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work > either. > > {code:java} > // Source: my-topic (1/2)#30587 > (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587) > switched from DEPLOYING to FAILED with failure cause: > java.lang.IllegalStateException: The library registration references a > different set of library BLOBs than previous registrations for this job: > old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396] > new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) > at java.base/java.lang.Thread.run(Unknown Source) {code} > If there is any other information that can help to identify the problem, > please let me know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32212) Job restarting indefinitely after an IllegalStateException from BlobLibraryCacheManager
[ https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17797288#comment-17797288 ] Ricky Saltzer commented on FLINK-32212: --- Hitting this same error after moving our jobs from being manually deployed in K8s to using ArgoCD. However, our failure is a bit different, as its not failing to deploy, but endlessly restarting at a random time after successfully running (e.g. 24 hours later). > Job restarting indefinitely after an IllegalStateException from > BlobLibraryCacheManager > --- > > Key: FLINK-32212 > URL: https://issues.apache.org/jira/browse/FLINK-32212 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.16.1 > Environment: Apache Flink Kubernetes Operator 1.4 >Reporter: Matheus Felisberto >Priority: Major > > After running for a few hours the job starts to throw IllegalStateException > and I can't figure out why. To restore the job, I need to manually delete the > FlinkDeployment to be recreated and redeploy everything. > The jar is built-in into the docker image, hence is defined accordingly with > the Operator's documentation: > {code:java} > // jarURI: local:///opt/flink/usrlib/my-job.jar {code} > I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work > either. > > {code:java} > // Source: my-topic (1/2)#30587 > (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587) > switched from DEPLOYING to FAILED with failure cause: > java.lang.IllegalStateException: The library registration references a > different set of library BLOBs than previous registrations for this job: > old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396] > new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) > at java.base/java.lang.Thread.run(Unknown Source) {code} > If there is any other information that can help to identify the problem, > please let me know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)