Hi ChangZhuo, Thanks for reporting, it looks like a bug. I've opened a ticket for that [1].
[1] https://issues.apache.org/jira/browse/FLINK-22966 Regards, Roman On Wed, Jun 9, 2021 at 4:07 PM ChangZhuo Chen (陳昌倬) <czc...@czchen.org> wrote: > > Hi, > > We have NullPointerException when trying to restore from savepoint for > the same jar, or different jar, or different parallelism. We have > workaround this issue by changing UIDs in almost all operators. We want > to know if there is anyway to avoid this problem so that it will not > cause service maintence problem, thanks. > > > The following is redacted stack trace we can provide for now: > > 2021-06-09 13:08:59,849 WARN > org.apache.flink.client.deployment.application.DetachedApplicationRunner [] - > Could not execute application: > org.apache.flink.client.program.ProgramInvocationException: The main > method caused an error: Failed to execute job '<censored>'. > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:84) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:70) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:102) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > [?:?] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > Caused by: org.apache.flink.util.FlinkException: Failed to execute job > '<censored>'. > at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1970) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:135) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1834) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:801) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at <censored> ~[?:?] > at <censored> ~[?:?] > at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) ~[?:?] > at > jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:?] > at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:?] > at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > ... 12 more > Caused by: java.lang.RuntimeException: > org.apache.flink.runtime.client.JobInitializationException: Could not start > the JobMaster. > at > org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:316) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:642) > ~[?:?] > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) > ~[?:?] > ... 1 more > Caused by: org.apache.flink.runtime.client.JobInitializationException: > Could not start the JobMaster. > at > org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) > ~[?:?] > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) > ~[?:?] > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > ~[?:?] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705) > ~[?:?] > ... 6 more > Caused by: java.util.concurrent.CompletionException: > java.lang.NullPointerException > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) > ~[?:?] > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) > ~[?:?] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702) > ~[?:?] > ... 6 more > Caused by: java.lang.NullPointerException > at > org.apache.flink.runtime.checkpoint.StateAssignmentOperation.reAssignSubKeyedStates(StateAssignmentOperation.java:300) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.checkpoint.StateAssignmentOperation.lambda$reDistributeKeyedStates$0(StateAssignmentOperation.java:260) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at java.util.HashMap.forEach(HashMap.java:1336) ~[?:?] > at > org.apache.flink.runtime.checkpoint.StateAssignmentOperation.reDistributeKeyedStates(StateAssignmentOperation.java:252) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.checkpoint.StateAssignmentOperation.assignAttemptState(StateAssignmentOperation.java:196) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.checkpoint.StateAssignmentOperation.assignStates(StateAssignmentOperation.java:139) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1562) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1642) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint(DefaultExecutionGraphFactory.java:163) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:138) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:342) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:190) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:120) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:132) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:340) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:317) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:107) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > at > org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112) > ~[flink-dist_2.12-1.13.1.jar:1.13.1] > ... 7 more > 2021-06-09 13:08:59,852 ERROR > org.apache.flink.runtime.webmonitor.handlers.JarRunHandler [] - Exception > occurred in REST handler: Could not execute application. > > > -- > ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org > http://czchen.info/ > Key fingerprint = BA04 346D C2E1 FE63 C790 8793 CC65 B0CD EC27 5D5B