Re: Flink failed to resume from checkpoint stored on S3
Hi Xiaolong From the log, seems there is no `_metadata` file in the checkpoint directory s3:///flink/checkpoint_dir/65786c3307a10e79a52b4de478cfe996/chk-7853. Do you configurate the retain checkpoint configuration[1] ever? If we do not configuration it, the checkpoint will be deleted if job stopped. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/state/checkpoints.html#retained-checkpoints Best, Congxian Xiaolong Wang 于2020年7月23日周四 上午10:03写道: > Deare community, > One of my Flink job failed yesterday, and when I tried to resume from > the latest checkpoint, following exceptions happen: > > > ``` > Log Type: jobmanager.err > > Log Upload Time: Wed Jul 22 09:04:24 + 2020 > > Log Length: 506 > > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/mnt/yarn/usercache/ec2-user/appcache/application_1591011685424_1054/filecache/10/slf4j-log4j12-1.7.15.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > > Log Type: jobmanager.log > > Log Upload Time: Wed Jul 22 09:04:24 + 2020 > > Log Length: 65177 > > Showing 4096 bytes of 65177 total. Click here for the full log. > > SchedulerBase.(SchedulerBase.java:215) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:120) > at > org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105) > at > org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278) > at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:266) > at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98) > at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40) > at > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:146) > ... 10 more > 2020-07-22 09:04:22,766 ERROR > org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler - > Unhandled exception. > java.lang.RuntimeException: > org.apache.flink.runtime.client.JobExecutionException: Could not set up > JobManager > at > org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Could > not set up JobManager > at > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:152) > at > org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84) > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:379) > at > org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34) > ... 7 more > Caused by: java.io.FileNotFoundException: Cannot find meta data file > '_metadata' in directory > 's3:///flink/checkpoint_dir/65786c3307a10e79a52b4de478cfe996/chk-7853'. > Please try to load the checkpoint/savepoint directly from the metadata file > instead of the directory. > at > org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:258) > at > org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:110) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1152) > at > org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:306) > at > org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:239) > at > org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:215) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:120) > at > org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105) > at > org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278) > at
Flink failed to resume from checkpoint stored on S3
Deare community, One of my Flink job failed yesterday, and when I tried to resume from the latest checkpoint, following exceptions happen: ``` Log Type: jobmanager.err Log Upload Time: Wed Jul 22 09:04:24 + 2020 Log Length: 506 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/mnt/yarn/usercache/ec2-user/appcache/application_1591011685424_1054/filecache/10/slf4j-log4j12-1.7.15.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Log Type: jobmanager.log Log Upload Time: Wed Jul 22 09:04:24 + 2020 Log Length: 65177 Showing 4096 bytes of 65177 total. Click here for the full log. SchedulerBase.(SchedulerBase.java:215) at org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:120) at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105) at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278) at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:266) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40) at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:146) ... 10 more 2020-07-22 09:04:22,766 ERROR org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler - Unhandled exception. java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:152) at org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:379) at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34) ... 7 more Caused by: java.io.FileNotFoundException: Cannot find meta data file '_metadata' in directory 's3:///flink/checkpoint_dir/65786c3307a10e79a52b4de478cfe996/chk-7853'. Please try to load the checkpoint/savepoint directly from the metadata file instead of the directory. at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:258) at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:110) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1152) at org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:306) at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:239) at org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:215) at org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:120) at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105) at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278) at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:266) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40) at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:146) ... 10 more 2020-07-22 09:04:22,771 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:34683 Log Type: jobmanager.out Log Upload Time: Wed Jul 22 09:04:24 + 2020 Log Length: 0 ```