[ https://issues.apache.org/jira/browse/FLINK-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bowen Li closed FLINK-7566. --------------------------- Resolution: Won't Fix > if there's only one checkpointing metadata file in <dir>, `flink run -s > <dir>` should successfully resume from that metadata file > ---------------------------------------------------------------------------------------------------------------------------------- > > Key: FLINK-7566 > URL: https://issues.apache.org/jira/browse/FLINK-7566 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing > Affects Versions: 1.3.2 > Reporter: Bowen Li > Assignee: Bowen Li > Priority: Major > > Currently, if we want to start a Flink job from a checkpointing file, we have > to run `flink run -s <dir>/checkpoint_metadata-xxxxx` by explicitly > specifying the checkpoint metadata file name 'checkpoint_metadata-xxxxx'. > Since metadata file name always changes, it's not easy to programmatically > restart a failed Flink job. The error from jobmanager.log looks like: > {code:java} > 2017-08-30 07:25:04,907 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job xxxx > (22defcf962ff2ac2e7fe99354f5ab168) switched from state FAILING to FAILED. > org.apache.flink.runtime.execution.SuppressRestartsException: Unrecoverable > failure. This suppresses job restarts. Please check the stack trace for the > root cause. > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:1396) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1372) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1372) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: java.io.IOException: Cannot find meta data file in directory > s3://xxxx/checkpoints. Please try to load the savepoint directly from the > meta data file instead of the directory. > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointStore.loadSavepointWithHandle(SavepointStore.java:262) > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:69) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1140) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:1386) > ... 10 more > {code} > What I want is like this: users should be able to start a Flink job by > running `flink run -s <dir>` if there's only one checkpointing metadata file > in <dir>. If there's none or more than 1 metadata file, the command can fail > like it is right now. This way, we can programmatically restart a failed > Flink job by hardcoding <dir>. > To achieve that, I think there're two appraches we can do: > 1) modify {{CheckpointCoordinator.restoreSavepoint}} to check how many > metadata files are in <dir> > 2) add another commandline option like '-sd' / '--savepointdirectory' to > explicitly load a dir -- This message was sent by Atlassian JIRA (v7.6.3#76005)