Just a small addition. Concurrent cancel call will interfere with the cancel-with-savepoint command and directly cancel the job. So it is better to use the cancel-with-savepoint call in order to take savepoint and then cancel the job automatically.
Cheers, Till On Thu, Aug 9, 2018 at 9:53 AM vino yang <yanghua1...@gmail.com> wrote: > Hi Juho, > > We use REST client API : triggerSavepoint(), this API returns a > CompletableFuture, then we call it's get() API. > > You can understand that I am waiting for it to complete in sync. > Because cancelWithSavepoint is actually waiting for savepoint to complete > synchronization, and then execute the cancel command. > > We do not use CLI. I think since you are through the CLI, you can observe > whether the savepoint is complete by combining the log or the web UI. > > Thanks, vino. > > > Juho Autio <juho.au...@rovio.com> 于2018年8月9日周四 下午3:07写道: > >> Thanks for the suggestion. Is the separate savepoint triggering async? >> Would you then separately poll for the savepoint's completion before >> executing cancel? If additional polling is needed, then I would say that >> for my purpose it's still easier to call cancel with savepoint and simply >> ignore the result of the call. I would assume that it won't do any harm if >> I keep retrying cancel with savepoint until the job stops – I expect that >> an overlapping cancel request is ignored if the job is already creating a >> savepoint. Please correct if my assumption is wrong. >> >> On Thu, Aug 9, 2018 at 5:04 AM vino yang <yanghua1...@gmail.com> wrote: >> >>> Hi Juho, >>> >>> This problem does exist, I suggest you separate these two steps to >>> temporarily deal with this problem: >>> 1) Trigger Savepoint separately; >>> 2) execute the cancel command; >>> >>> Hi Till, Chesnay: >>> >>> Our internal environment and multiple users on the mailing list have >>> encountered similar problems. >>> >>> In our environment, it seems that JM shows that the save point is >>> complete and JM has stopped itself, but the client will still connect to >>> the old JM and report a timeout exception. >>> >>> Thanks, vino. >>> >>> >>> Juho Autio <juho.au...@rovio.com> 于2018年8月8日周三 下午9:18写道: >>> >>>> I was trying to cancel a job with savepoint, but the CLI command failed >>>> with "akka.pattern.AskTimeoutException: Ask timed out". >>>> >>>> The stack trace reveals that ask timeout is 10 seconds: >>>> >>>> Caused by: akka.pattern.AskTimeoutException: Ask timed out on >>>> [Actor[akka://flink/user/jobmanager_0#106635280]] after [10000 ms]. >>>> Sender[null] sent message of type >>>> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". >>>> >>>> Indeed it's documented that the default value for akka.ask.timeout="10 >>>> s" in >>>> >>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka >>>> >>>> Behind the scenes the savepoint creation & job cancellation succeeded, >>>> that was to be expected, kind of. So my problem is just getting a proper >>>> response back from the CLI call instead of timing out so eagerly. >>>> >>>> To be exact, what I ran was: >>>> >>>> flink-1.5.2/bin/flink cancel b7c7d19d25e16a952d3afa32841024e5 -m >>>> yarn-cluster -yid application_1533676784032_0001 --withSavepoint >>>> >>>> Should I change the akka.ask.timeout to have a longer timeout? If yes, >>>> can I override it just for the CLI call somehow? Maybe it might have >>>> undesired side-effects if set globally for the actual flink jobs to use? >>>> >>>> What about akka.client.timeout? The default for it is also rather >>>> low: "60 s". Should it also be increased accordingly if I want to accept >>>> longer than 60 s for savepoint creation? >>>> >>>> Finally, that default timeout is so low that I would expect this to be >>>> a common problem. I would say that Flink CLI should have higher default >>>> timeout for cancel and savepoint creation ops. >>>> >>>> Thanks! >>>> >>> >>