[jira] [Commented] (FLINK-32212) Job restarting indefinitely after an IllegalStateException from BlobLibraryCacheManager
[ https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843461#comment-17843461 ] Ricky Saltzer commented on FLINK-32212: --- Hey [~rmetzger] - I'm also having a hard time reproducing at the moment but from what I can tell...the issue happens when the JobManager is lost and recreated, but the TaskManager K8s pods remain online (to be reused) by the new JobManager. It does appear to be somewhat inconsistent in that way too, because the fix for me is to manually delete the JobManager pod and hope the next startup works (usually does). > Job restarting indefinitely after an IllegalStateException from > BlobLibraryCacheManager > --- > > Key: FLINK-32212 > URL: https://issues.apache.org/jira/browse/FLINK-32212 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.16.1 > Environment: Apache Flink Kubernetes Operator 1.4 >Reporter: Matheus Felisberto >Priority: Major > > After running for a few hours the job starts to throw IllegalStateException > and I can't figure out why. To restore the job, I need to manually delete the > FlinkDeployment to be recreated and redeploy everything. > The jar is built-in into the docker image, hence is defined accordingly with > the Operator's documentation: > {code:java} > // jarURI: local:///opt/flink/usrlib/my-job.jar {code} > I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work > either. > > {code:java} > // Source: my-topic (1/2)#30587 > (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587) > switched from DEPLOYING to FAILED with failure cause: > java.lang.IllegalStateException: The library registration references a > different set of library BLOBs than previous registrations for this job: > old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396] > new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) > at java.base/java.lang.Thread.run(Unknown Source) {code} > If there is any other information that can help to identify the problem, > please let me know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32212) Job restarting indefinitely after an IllegalStateException from BlobLibraryCacheManager
[ https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841985#comment-17841985 ] Robert Metzger commented on FLINK-32212: Thanks [~rickysaltzer]. Have you figured out what exactly is triggering this? If I remember correctly, I was able to trigger it by regenerating the JobGraph on the JobManager (while keeping the jobid the same) and resubmitting the job (e.g. job would restart) while keeping TaskManagers running. So probably some integrity check on the TaskManager noticed that the jars of a jobid have changed, causing this exception. For debugging this, I'd like to find an easy way of reproducing it. My approach seems somewhat "illegal", because messing with the JobGraph that not what Flink has been designed for (in my understanding). So I'm asking if somebody who has observed this issue has a reliable way (and a way of using Flink in an expected / documented manner) to reproduce this issue? > Job restarting indefinitely after an IllegalStateException from > BlobLibraryCacheManager > --- > > Key: FLINK-32212 > URL: https://issues.apache.org/jira/browse/FLINK-32212 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.16.1 > Environment: Apache Flink Kubernetes Operator 1.4 >Reporter: Matheus Felisberto >Priority: Major > > After running for a few hours the job starts to throw IllegalStateException > and I can't figure out why. To restore the job, I need to manually delete the > FlinkDeployment to be recreated and redeploy everything. > The jar is built-in into the docker image, hence is defined accordingly with > the Operator's documentation: > {code:java} > // jarURI: local:///opt/flink/usrlib/my-job.jar {code} > I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work > either. > > {code:java} > // Source: my-topic (1/2)#30587 > (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587) > switched from DEPLOYING to FAILED with failure cause: > java.lang.IllegalStateException: The library registration references a > different set of library BLOBs than previous registrations for this job: > old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396] > new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) > at java.base/java.lang.Thread.run(Unknown Source) {code} > If there is any other information that can help to identify the problem, > please let me know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32212) Job restarting indefinitely after an IllegalStateException from BlobLibraryCacheManager
[ https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837354#comment-17837354 ] Ricky Saltzer commented on FLINK-32212: --- Verified we started hitting this pretty consistently when using K8s + Karptenter with spot instances. > Job restarting indefinitely after an IllegalStateException from > BlobLibraryCacheManager > --- > > Key: FLINK-32212 > URL: https://issues.apache.org/jira/browse/FLINK-32212 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.16.1 > Environment: Apache Flink Kubernetes Operator 1.4 >Reporter: Matheus Felisberto >Priority: Major > > After running for a few hours the job starts to throw IllegalStateException > and I can't figure out why. To restore the job, I need to manually delete the > FlinkDeployment to be recreated and redeploy everything. > The jar is built-in into the docker image, hence is defined accordingly with > the Operator's documentation: > {code:java} > // jarURI: local:///opt/flink/usrlib/my-job.jar {code} > I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work > either. > > {code:java} > // Source: my-topic (1/2)#30587 > (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587) > switched from DEPLOYING to FAILED with failure cause: > java.lang.IllegalStateException: The library registration references a > different set of library BLOBs than previous registrations for this job: > old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396] > new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) > at java.base/java.lang.Thread.run(Unknown Source) {code} > If there is any other information that can help to identify the problem, > please let me know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32212) Job restarting indefinitely after an IllegalStateException from BlobLibraryCacheManager
[ https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817914#comment-17817914 ] Dheeraj Panangat commented on FLINK-32212: -- Facing the same issue. I see that of the 4 task managers one of them was with the naming version task-manager-2-1 while others where task-manager-1-2 task-manager-1-3 task-manager-1-4 Do we know when would such scenario arrive? Do we have a resolution ? > Job restarting indefinitely after an IllegalStateException from > BlobLibraryCacheManager > --- > > Key: FLINK-32212 > URL: https://issues.apache.org/jira/browse/FLINK-32212 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.16.1 > Environment: Apache Flink Kubernetes Operator 1.4 >Reporter: Matheus Felisberto >Priority: Major > > After running for a few hours the job starts to throw IllegalStateException > and I can't figure out why. To restore the job, I need to manually delete the > FlinkDeployment to be recreated and redeploy everything. > The jar is built-in into the docker image, hence is defined accordingly with > the Operator's documentation: > {code:java} > // jarURI: local:///opt/flink/usrlib/my-job.jar {code} > I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work > either. > > {code:java} > // Source: my-topic (1/2)#30587 > (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587) > switched from DEPLOYING to FAILED with failure cause: > java.lang.IllegalStateException: The library registration references a > different set of library BLOBs than previous registrations for this job: > old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396] > new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) > at java.base/java.lang.Thread.run(Unknown Source) {code} > If there is any other information that can help to identify the problem, > please let me know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32212) Job restarting indefinitely after an IllegalStateException from BlobLibraryCacheManager
[ https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809152#comment-17809152 ] Anurag Kyal commented on FLINK-32212: - I am running into this issue as well after my job restarts (trying to figure out the reason for random restarts) but these errors show up. Most of the times it stops after 8-10 exceptions but occasionally it gets into an infinite loop. I am using a pubsub connector to read notifications as the first step in my job and seems like these exceptions are related to this step - at least that is what the UI says. Was anyone able to find other solutions rather than manual intervention which I want to avoid in the middle of night. > Job restarting indefinitely after an IllegalStateException from > BlobLibraryCacheManager > --- > > Key: FLINK-32212 > URL: https://issues.apache.org/jira/browse/FLINK-32212 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.16.1 > Environment: Apache Flink Kubernetes Operator 1.4 >Reporter: Matheus Felisberto >Priority: Major > > After running for a few hours the job starts to throw IllegalStateException > and I can't figure out why. To restore the job, I need to manually delete the > FlinkDeployment to be recreated and redeploy everything. > The jar is built-in into the docker image, hence is defined accordingly with > the Operator's documentation: > {code:java} > // jarURI: local:///opt/flink/usrlib/my-job.jar {code} > I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work > either. > > {code:java} > // Source: my-topic (1/2)#30587 > (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587) > switched from DEPLOYING to FAILED with failure cause: > java.lang.IllegalStateException: The library registration references a > different set of library BLOBs than previous registrations for this job: > old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396] > new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) > at java.base/java.lang.Thread.run(Unknown Source) {code} > If there is any other information that can help to identify the problem, > please let me know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32212) Job restarting indefinitely after an IllegalStateException from BlobLibraryCacheManager
[ https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802295#comment-17802295 ] Samuel Brotherton commented on FLINK-32212: --- We are seeing this issue as well; it breaks our pipelines whenever the JobManager pod gets redeployed due to a spot instance scale down, etc. > Job restarting indefinitely after an IllegalStateException from > BlobLibraryCacheManager > --- > > Key: FLINK-32212 > URL: https://issues.apache.org/jira/browse/FLINK-32212 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.16.1 > Environment: Apache Flink Kubernetes Operator 1.4 >Reporter: Matheus Felisberto >Priority: Major > > After running for a few hours the job starts to throw IllegalStateException > and I can't figure out why. To restore the job, I need to manually delete the > FlinkDeployment to be recreated and redeploy everything. > The jar is built-in into the docker image, hence is defined accordingly with > the Operator's documentation: > {code:java} > // jarURI: local:///opt/flink/usrlib/my-job.jar {code} > I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work > either. > > {code:java} > // Source: my-topic (1/2)#30587 > (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587) > switched from DEPLOYING to FAILED with failure cause: > java.lang.IllegalStateException: The library registration references a > different set of library BLOBs than previous registrations for this job: > old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396] > new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) > at java.base/java.lang.Thread.run(Unknown Source) {code} > If there is any other information that can help to identify the problem, > please let me know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32212) Job restarting indefinitely after an IllegalStateException from BlobLibraryCacheManager
[ https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797866#comment-17797866 ] David Christle commented on FLINK-32212: We also see this issue. In one case, the logs appear to show k8s scaling down the node containing the JobManager. When it restarts, it tries to redeploy the Flink application, but endlessly retries with "The library registration references a different set of library BLOBs" error. > Job restarting indefinitely after an IllegalStateException from > BlobLibraryCacheManager > --- > > Key: FLINK-32212 > URL: https://issues.apache.org/jira/browse/FLINK-32212 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.16.1 > Environment: Apache Flink Kubernetes Operator 1.4 >Reporter: Matheus Felisberto >Priority: Major > > After running for a few hours the job starts to throw IllegalStateException > and I can't figure out why. To restore the job, I need to manually delete the > FlinkDeployment to be recreated and redeploy everything. > The jar is built-in into the docker image, hence is defined accordingly with > the Operator's documentation: > {code:java} > // jarURI: local:///opt/flink/usrlib/my-job.jar {code} > I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work > either. > > {code:java} > // Source: my-topic (1/2)#30587 > (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587) > switched from DEPLOYING to FAILED with failure cause: > java.lang.IllegalStateException: The library registration references a > different set of library BLOBs than previous registrations for this job: > old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396] > new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) > at java.base/java.lang.Thread.run(Unknown Source) {code} > If there is any other information that can help to identify the problem, > please let me know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32212) Job restarting indefinitely after an IllegalStateException from BlobLibraryCacheManager
[ https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797288#comment-17797288 ] Ricky Saltzer commented on FLINK-32212: --- Hitting this same error after moving our jobs from being manually deployed in K8s to using ArgoCD. However, our failure is a bit different, as its not failing to deploy, but endlessly restarting at a random time after successfully running (e.g. 24 hours later). > Job restarting indefinitely after an IllegalStateException from > BlobLibraryCacheManager > --- > > Key: FLINK-32212 > URL: https://issues.apache.org/jira/browse/FLINK-32212 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.16.1 > Environment: Apache Flink Kubernetes Operator 1.4 >Reporter: Matheus Felisberto >Priority: Major > > After running for a few hours the job starts to throw IllegalStateException > and I can't figure out why. To restore the job, I need to manually delete the > FlinkDeployment to be recreated and redeploy everything. > The jar is built-in into the docker image, hence is defined accordingly with > the Operator's documentation: > {code:java} > // jarURI: local:///opt/flink/usrlib/my-job.jar {code} > I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work > either. > > {code:java} > // Source: my-topic (1/2)#30587 > (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587) > switched from DEPLOYING to FAILED with failure cause: > java.lang.IllegalStateException: The library registration references a > different set of library BLOBs than previous registrations for this job: > old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396] > new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) > at java.base/java.lang.Thread.run(Unknown Source) {code} > If there is any other information that can help to identify the problem, > please let me know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32212) Job restarting indefinitely after an IllegalStateException from BlobLibraryCacheManager
[ https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760842#comment-17760842 ] Jan Drazil commented on FLINK-32212: The problem typically arises for me when Kubernetes decides to terminate the pod with the task manager (such as during node draining). The exception appears once the new task manager start up. I attempted to terminate other "outdated" task managers, and this resulted in the job starting. However, for an unknown reason, it is not recovered from the checkpoint. In my next attempt, I will explore Robert Metzger's approach. > Job restarting indefinitely after an IllegalStateException from > BlobLibraryCacheManager > --- > > Key: FLINK-32212 > URL: https://issues.apache.org/jira/browse/FLINK-32212 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.16.1 > Environment: Apache Flink Kubernetes Operator 1.4 >Reporter: Matheus Felisberto >Priority: Major > > After running for a few hours the job starts to throw IllegalStateException > and I can't figure out why. To restore the job, I need to manually delete the > FlinkDeployment to be recreated and redeploy everything. > The jar is built-in into the docker image, hence is defined accordingly with > the Operator's documentation: > {code:java} > // jarURI: local:///opt/flink/usrlib/my-job.jar {code} > I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work > either. > > {code:java} > // Source: my-topic (1/2)#30587 > (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587) > switched from DEPLOYING to FAILED with failure cause: > java.lang.IllegalStateException: The library registration references a > different set of library BLOBs than previous registrations for this job: > old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396] > new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) > at java.base/java.lang.Thread.run(Unknown Source) {code} > If there is any other information that can help to identify the problem, > please let me know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32212) Job restarting indefinitely after an IllegalStateException from BlobLibraryCacheManager
[ https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17739982#comment-17739982 ] Robert Metzger commented on FLINK-32212: I also just ran into this situation with the Flink K8s Operator and resolved it by regenerating the JobGraph. To do so, I first scaled the Flink JobManager deployment to 0. Then, I removed the "jobGraph-yyy" key from the "xxx-cluster-config-map" ConfigMap. Next, I scaled the deployment up to 1 again, and watched the job successfully (and with state) recover from the last checkpoint. > Job restarting indefinitely after an IllegalStateException from > BlobLibraryCacheManager > --- > > Key: FLINK-32212 > URL: https://issues.apache.org/jira/browse/FLINK-32212 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.16.1 > Environment: Apache Flink Kubernetes Operator 1.4 >Reporter: Matheus Felisberto >Priority: Major > > After running for a few hours the job starts to throw IllegalStateException > and I can't figure out why. To restore the job, I need to manually delete the > FlinkDeployment to be recreated and redeploy everything. > The jar is built-in into the docker image, hence is defined accordingly with > the Operator's documentation: > {code:java} > // jarURI: local:///opt/flink/usrlib/my-job.jar {code} > I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work > either. > > {code:java} > // Source: my-topic (1/2)#30587 > (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587) > switched from DEPLOYING to FAILED with failure cause: > java.lang.IllegalStateException: The library registration references a > different set of library BLOBs than previous registrations for this job: > old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396] > new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2] > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) > at java.base/java.lang.Thread.run(Unknown Source) {code} > If there is any other information that can help to identify the problem, > please let me know. > -- This message was sent by Atlassian Jira (v8.20.10#820010)