[jira] [Commented] (FLINK-9043) Introduce a friendly way to resume the job from externalized checkpoints automatically

Stephan Ewen (JIRA) Wed, 16 May 2018 05:42:21 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477362#comment-16477362
 ]


Stephan Ewen commented on FLINK-9043:
-------------------------------------

Thank you for the discussion. To me, there are two essential things to figure 
out:

  1. What you raised before: Finding checkpoints, avoiding collisions.
  2. How to handle S3 consistency/visibility issues when trying to find the 
latest completed checkpoint.

Concerning (1):

I would like to avoid letting users configure JobID as much as possible. These 
Job IDs internally help to ensure consistency between requests for the same 
job, and are used avoid collisions on paths by different jobs (in ZK, HDFS, 
custom sinks, ...).

  - If we go with the "fail if more than one JobID directory is there" option, 
the automatic resume will work once, and not again, because after that, there 
may be more than one JobID directory.

  - We can have an option to skip adding the JobID to the checkpoint directory, 
but then we need to think about preventing multiple jobs to overwrite each 
other's checkpoints. Is there a way we can "lock" a directory for one job, so 
that no other job accidentally overwrites the checkpoints in the same 
directory? Maybe "best effort" locking via a marker file?

Concerning (2)
I actually don't have a good idea, yet. The general problem is the following:

  - If you list the paths in S3 (to see which directory is the highest 
checkpoint number), you might not see all of them, yet.
  - Lets say we have chk-177 and chk-179. The chk-178 was aborted/failed and 
hence does not exist).
  - Listing the S3 buckets shows only chk-177 (listing in S3 is only eventually 
consistent).
  - We can try and be clever and go and check successors (who might not yet be 
visible) by trying to open the successors (GET chk-178) which does not exist, 
and we still miss chk-179.

We could try and write a file "lastCompleted" to the checkpoint dir root, but 
that begs question about what happens when the file gets written, but the ZK 
update (latest completed checkpoint) does not happen. Also, S3 will cause 
problems again, because overwriting files is only eventually consistent.

 S3 is just a real mess for these kink of operations. We could say that we 
initially exclude S3 from this effort. In that case, the "find latest 
checkpoint" would be something that is FS-dependent, so we need to
  - either push it into the FS classes (maybe an {{ExtendedFileSystem}} 
interface that only some FileSystems implement)
  - or we have a "whitelist" of supported FS types in the resume-checkpoint 
logic.
Both variants are also is also tricky, because there are "delegating file 
systems" (like viewfs://) which may hide the fact that the actual underlying 
file system is HDFS or S3. We could again say that we do not support 
"viewfs://", but I am wondering if that feature than works only for very few 
people. I my knowledge, S3 is actually used by very many Flink users.



> Introduce a friendly way to resume the job from externalized checkpoints 
> automatically
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-9043
>                 URL: https://issues.apache.org/jira/browse/FLINK-9043
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: godfrey johnson
>            Assignee: Sihua Zhou
>            Priority: Major
>
> I know a flink job can reovery from checkpoint with restart strategy, but can 
> not recovery as spark streaming jobs when job is starting.
> Every time, the submitted flink job is regarded as a new job, while , in the 
> spark streaming  job, which can detect the checkpoint directory first,  and 
> then recovery from the latest succeed one. However, Flink only can recovery 
> until the job failed first, then retry with strategy.
>  
> So, would flink support to recover from the checkpoint directly in a new job?
> h2. New description by [~sihuazhou]
> Currently, it's quite a bit not friendly for users to recover job from the 
> externalized checkpoint, user need to find the dedicate dir for the job which 
> is not a easy thing when there are too many jobs. This ticket attend to 
> introduce a more friendly way to allow the user to use the externalized 
> checkpoint to do recovery.
> The implementation steps are copied from the comments of [~StephanEwen]:
>  - We could make this an option where you pass a flag (-r) to automatically 
> look for the latest checkpoint in a given directory.
>  - If more than one jobs checkpointed there before, this operation would fail.
>  - We might also need a way to have jobs not create the UUID subdirectory, 
> otherwise the scanning for the latest checkpoint would not easily work.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-9043) Introduce a friendly way to resume the job from externalized checkpoints automatically

Reply via email to