[ 
https://issues.apache.org/jira/browse/FLINK-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565708#comment-15565708
 ] 

ASF GitHub Bot commented on FLINK-4717:
---------------------------------------

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2609#discussion_r82814608
  
    --- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
    @@ -581,6 +581,62 @@ class JobManager(
               )
           }
     
    +    case CancelJobWithSavepoint(jobId, savepointDirectory) =>
    +      try {
    +        val targetDirectory = if (savepointDirectory != null) {
    +          savepointDirectory
    +        } else {
    +          defaultSavepointDir
    +        }
    +
    +        log.info(s"Trying to cancel job $jobId with savepoint to 
$targetDirectory")
    +
    +        currentJobs.get(jobId) match {
    +          case Some((executionGraph, _)) =>
    +            // We don't want any checkpoint between the savepoint and 
cancellation
    +            val coord = executionGraph.getCheckpointCoordinator
    +            coord.stopCheckpointScheduler()
    --- End diff --
    
    I think it's not enough to simply call `stopCheckpointScheduler`. If I'm 
not mistaken, then the following could happen: You call 
`stopCheckpointScheduler` which will try to `cancel` the last 
`currentPeriodicTrigger`. Now assume that the last `TimerTask` to trigger the 
next checkpoint has just been triggered but not executed (just before 
cancelling it). Now the `stopCheckpointScheduler` finishes without the 
`TimerTask` having completed. Now the `TimerTask` can still trigger a 
checkpoint even though we've stopped the checkpoint scheduler.
    
    The way to fix this (admittedly academic corner case), is to filter out 
outdated `TimerTask` calls in the `CheckpointCoordinator` by having a kind of 
fencing tokens for the trigger checkpoint calls.


> Naive version of atomic stop signal with savepoint
> --------------------------------------------------
>
>                 Key: FLINK-4717
>                 URL: https://issues.apache.org/jira/browse/FLINK-4717
>             Project: Flink
>          Issue Type: New Feature
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.2.0
>            Reporter: Till Rohrmann
>            Priority: Minor
>             Fix For: 1.2.0
>
>
> As a first step towards atomic stopping with savepoints we should implement a 
> cancel command which prior to cancelling takes a savepoint. Additionally, it 
> should turn off the periodic checkpointing so that there won't be checkpoints 
> executed between the savepoint and the cancel command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to