[ https://issues.apache.org/jira/browse/FLINK-10671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674975#comment-16674975 ]
Gary Yao edited comment on FLINK-10671 at 11/5/18 11:05 AM: ------------------------------------------------------------ [~cre...@gmail.com], could it be that you are using 1.6.0? This looks like a duplicate of FLINK-10193, which got fixed in 1.6.1. I am unable to reproduce the problem in 1.6.1 using the {{SlowToCheckpoint}} job, which is attached to FLINK-10193. The job artificially slows down the {{snapshotState}} callback by a configurable amount of time. flink-conf.yaml {noformat} akka.ask.timeout: 20 s {noformat} Commands to submit and take savepoint: {code} bin/flink run -d -c com.github.uce.SlowToCheckpoint testjob.jar --checkpointDuration 45000 bin/flink cancel -s file:///tmp/flink-savepoints <job-id> {code} If your version is indeed 1.6.1, please add more information on how to reproduce the problem (preferably using the {{SlowToCheckpoint}} job). was (Author: gjy): [~cre...@gmail.com], could it be that you are using 1.6.0? This looks like a duplicate of FLINK-10193, which got fixed in 1.6.1. I am unable to reproduce the problem in 1.6.1 using the {{SlowToCheckpoint}} job that is attached to FLINK-10193. The job artificially slows down the {{snapshotState}} callback by a configurable amount of time. flink-conf.yaml {noformat} akka.ask.timeout: 20 s {noformat} Commands to submit and take savepoint: {code} bin/flink run -d -c com.github.uce.SlowToCheckpoint testjob.jar --checkpointDuration 45000 bin/flink cancel -s file:///tmp/flink-savepoints <job-id> {code} If your version is indeed 1.6.1, please add more information on how to reproduce the problem (preferably using the {{SlowToCheckpoint}} job). > rest monitoring api Savepoint status call fails if akka.ask.timeout < > checkpoint duration > ----------------------------------------------------------------------------------------- > > Key: FLINK-10671 > URL: https://issues.apache.org/jira/browse/FLINK-10671 > Project: Flink > Issue Type: Bug > Components: REST > Affects Versions: 1.6.1 > Reporter: Cliff Resnick > Assignee: Gary Yao > Priority: Minor > > Hi, > > There seems to be a problem with REST monitoring API: > |/jobs/:jobid/savepoints/:triggerid| > > The problem is that when the Savepoint represented by :triggerid is called > with `cancel=true` the above status call seems to fail if the savepoint > duration exceeds `akka.ask.timeout` value. > > Below is a log in which I invoke "cancel with savepoint" then poll the above > endpoint for status at 2 second intervals. akka.ask.timout is set for twenty > seconds. The error is repeatable at various values of akka.ask.timeout. > > 2018/10/24 19:42:25 savepoint id 925964b35b2d501f4a45b714eca0a2ca is > IN_PROGRESS > 2018/10/24 19:42:27 savepoint id 925964b35b2d501f4a45b714eca0a2ca is > IN_PROGRESS > 2018/10/24 19:42:29 savepoint id 925964b35b2d501f4a45b714eca0a2ca is > IN_PROGRESS > 2018/10/24 19:42:31 savepoint id 925964b35b2d501f4a45b714eca0a2ca is > IN_PROGRESS > 2018/10/24 19:42:33 savepoint id 925964b35b2d501f4a45b714eca0a2ca is > IN_PROGRESS > 2018/10/24 19:42:35 savepoint id 925964b35b2d501f4a45b714eca0a2ca is > IN_PROGRESS > 2018/10/24 19:42:37 savepoint id 925964b35b2d501f4a45b714eca0a2ca is > IN_PROGRESS > 2018/10/24 19:42:39 savepoint id 925964b35b2d501f4a45b714eca0a2ca is > IN_PROGRESS > 2018/10/24 19:42:41 savepoint id 925964b35b2d501f4a45b714eca0a2ca is > IN_PROGRESS > 2018/10/24 19:42:43 savepoint id 925964b35b2d501f4a45b714eca0a2ca is > IN_PROGRESS > 2018/10/24 19:42:45 Cancel with Savepoint may have failed: > java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: > Ask timed out on [Actor[akka://flink/user/jobmanager_0#-234856817]] after > [20000 ms]. Sender[null] sent message of type > "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". > at > java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326) > at > java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338) > at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911) > at > java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770) > at akka.dispatch.OnComplete.internal(Future.scala:258) > at akka.dispatch.OnComplete.internal(Future.scala:256) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) > at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) > at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) > at java.lang.Thread.run(Thread.java:748) > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka://flink/user/jobmanager_0#-234856817]] after [20000 ms]. > Sender[null] sent message of type > "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". > at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) > ... 9 more -- This message was sent by Atlassian JIRA (v7.6.3#76005)