[ https://issues.apache.org/jira/browse/FLINK-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477362#comment-16477362 ]
Stephan Ewen commented on FLINK-9043: ------------------------------------- Thank you for the discussion. To me, there are two essential things to figure out: 1. What you raised before: Finding checkpoints, avoiding collisions. 2. How to handle S3 consistency/visibility issues when trying to find the latest completed checkpoint. Concerning (1): I would like to avoid letting users configure JobID as much as possible. These Job IDs internally help to ensure consistency between requests for the same job, and are used avoid collisions on paths by different jobs (in ZK, HDFS, custom sinks, ...). - If we go with the "fail if more than one JobID directory is there" option, the automatic resume will work once, and not again, because after that, there may be more than one JobID directory. - We can have an option to skip adding the JobID to the checkpoint directory, but then we need to think about preventing multiple jobs to overwrite each other's checkpoints. Is there a way we can "lock" a directory for one job, so that no other job accidentally overwrites the checkpoints in the same directory? Maybe "best effort" locking via a marker file? Concerning (2) I actually don't have a good idea, yet. The general problem is the following: - If you list the paths in S3 (to see which directory is the highest checkpoint number), you might not see all of them, yet. - Lets say we have chk-177 and chk-179. The chk-178 was aborted/failed and hence does not exist). - Listing the S3 buckets shows only chk-177 (listing in S3 is only eventually consistent). - We can try and be clever and go and check successors (who might not yet be visible) by trying to open the successors (GET chk-178) which does not exist, and we still miss chk-179. We could try and write a file "lastCompleted" to the checkpoint dir root, but that begs question about what happens when the file gets written, but the ZK update (latest completed checkpoint) does not happen. Also, S3 will cause problems again, because overwriting files is only eventually consistent. S3 is just a real mess for these kink of operations. We could say that we initially exclude S3 from this effort. In that case, the "find latest checkpoint" would be something that is FS-dependent, so we need to - either push it into the FS classes (maybe an {{ExtendedFileSystem}} interface that only some FileSystems implement) - or we have a "whitelist" of supported FS types in the resume-checkpoint logic. Both variants are also is also tricky, because there are "delegating file systems" (like viewfs://) which may hide the fact that the actual underlying file system is HDFS or S3. We could again say that we do not support "viewfs://", but I am wondering if that feature than works only for very few people. I my knowledge, S3 is actually used by very many Flink users. > Introduce a friendly way to resume the job from externalized checkpoints > automatically > -------------------------------------------------------------------------------------- > > Key: FLINK-9043 > URL: https://issues.apache.org/jira/browse/FLINK-9043 > Project: Flink > Issue Type: New Feature > Reporter: godfrey johnson > Assignee: Sihua Zhou > Priority: Major > > I know a flink job can reovery from checkpoint with restart strategy, but can > not recovery as spark streaming jobs when job is starting. > Every time, the submitted flink job is regarded as a new job, while , in the > spark streaming job, which can detect the checkpoint directory first, and > then recovery from the latest succeed one. However, Flink only can recovery > until the job failed first, then retry with strategy. > > So, would flink support to recover from the checkpoint directly in a new job? > h2. New description by [~sihuazhou] > Currently, it's quite a bit not friendly for users to recover job from the > externalized checkpoint, user need to find the dedicate dir for the job which > is not a easy thing when there are too many jobs. This ticket attend to > introduce a more friendly way to allow the user to use the externalized > checkpoint to do recovery. > The implementation steps are copied from the comments of [~StephanEwen]: > - We could make this an option where you pass a flag (-r) to automatically > look for the latest checkpoint in a given directory. > - If more than one jobs checkpointed there before, this operation would fail. > - We might also need a way to have jobs not create the UUID subdirectory, > otherwise the scanning for the latest checkpoint would not easily work. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)