[ 
https://issues.apache.org/jira/browse/FLINK-17824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136679#comment-17136679
 ] 

Piotr Nowojski commented on FLINK-17824:
----------------------------------------

Still I don't see why it is or why does it have to be so slow. But let's wait 
and see if the increased sleep time helped and let's revisit the problem only 
if it doesn't help.

> "Resuming Savepoint" e2e stalls indefinitely 
> ---------------------------------------------
>
>                 Key: FLINK-17824
>                 URL: https://issues.apache.org/jira/browse/FLINK-17824
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Tests
>    Affects Versions: 1.10.1, 1.11.0
>            Reporter: Robert Metzger
>            Assignee: Roman Khachatryan
>            Priority: Critical
>              Labels: pull-request-available, test-stability
>             Fix For: 1.12.0
>
>
> CI; 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1887&view=logs&j=91bf6583-3fb2-592f-e4d4-d79d79c3230a&t=94459a52-42b6-5bfc-5d74-690b5d3c6de8
> {code}
> 2020-05-19T21:05:52.9696236Z 
> ==============================================================================
> 2020-05-19T21:05:52.9696860Z Running 'Resuming Savepoint (file, async, scale 
> down) end-to-end test'
> 2020-05-19T21:05:52.9697243Z 
> ==============================================================================
> 2020-05-19T21:05:52.9713094Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-52970362751
> 2020-05-19T21:05:53.1194478Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.12-SNAPSHOT-bin/flink-1.12-SNAPSHOT
> 2020-05-19T21:05:53.2180375Z Starting cluster.
> 2020-05-19T21:05:53.9986167Z Starting standalonesession daemon on host 
> fv-az558.
> 2020-05-19T21:05:55.5997224Z Starting taskexecutor daemon on host fv-az558.
> 2020-05-19T21:05:55.6223837Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-05-19T21:05:57.0552482Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-05-19T21:05:57.9446865Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-05-19T21:05:59.0098434Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-05-19T21:06:00.0569710Z Dispatcher REST endpoint is up.
> 2020-05-19T21:06:07.7099937Z Job (a92a74de8446a80403798bb4806b73f3) is 
> running.
> 2020-05-19T21:06:07.7855906Z Waiting for job to process up to 200 records, 
> current progress: 114 records ...
> 2020-05-19T21:06:55.5755111Z 
> 2020-05-19T21:06:55.5756550Z 
> ------------------------------------------------------------
> 2020-05-19T21:06:55.5757225Z  The program finished with the following 
> exception:
> 2020-05-19T21:06:55.5757566Z 
> 2020-05-19T21:06:55.5765453Z org.apache.flink.util.FlinkException: Could not 
> stop with a savepoint job "a92a74de8446a80403798bb4806b73f3".
> 2020-05-19T21:06:55.5766873Z  at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:485)
> 2020-05-19T21:06:55.5767980Z  at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:854)
> 2020-05-19T21:06:55.5769014Z  at 
> org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:477)
> 2020-05-19T21:06:55.5770052Z  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:921)
> 2020-05-19T21:06:55.5771107Z  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:982)
> 2020-05-19T21:06:55.5772223Z  at 
> org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
> 2020-05-19T21:06:55.5773325Z  at 
> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:982)
> 2020-05-19T21:06:55.5774871Z Caused by: 
> java.util.concurrent.ExecutionException: 
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> Coordinator is suspending.
> 2020-05-19T21:06:55.5777183Z  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 2020-05-19T21:06:55.5778884Z  at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> 2020-05-19T21:06:55.5779920Z  at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:483)
> 2020-05-19T21:06:55.5781175Z  ... 6 more
> 2020-05-19T21:06:55.5782391Z Caused by: 
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> Coordinator is suspending.
> 2020-05-19T21:06:55.5783885Z  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$stopWithSavepoint$9(SchedulerBase.java:890)
> 2020-05-19T21:06:55.5784992Z  at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> 2020-05-19T21:06:55.5786492Z  at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> 2020-05-19T21:06:55.5787601Z  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
> 2020-05-19T21:06:55.5788682Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> 2020-05-19T21:06:55.5790308Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> 2020-05-19T21:06:55.5791664Z  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> 2020-05-19T21:06:55.5792767Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> 2020-05-19T21:06:55.5793756Z  at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> 2020-05-19T21:06:55.5794652Z  at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> 2020-05-19T21:06:55.5795605Z  at 
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> 2020-05-19T21:06:55.5796551Z  at 
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> 2020-05-19T21:06:55.5797459Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> 2020-05-19T21:06:55.5798390Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-05-19T21:06:55.5799311Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-05-19T21:06:55.5800175Z  at 
> akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> 2020-05-19T21:06:55.5801078Z  at 
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> 2020-05-19T21:06:55.5802741Z  at 
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> 2020-05-19T21:06:55.5803579Z  at 
> akka.actor.ActorCell.invoke(ActorCell.scala:561)
> 2020-05-19T21:06:55.5804628Z  at 
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> 2020-05-19T21:06:55.5805435Z  at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> 2020-05-19T21:06:55.5806194Z  at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> 2020-05-19T21:06:55.5807037Z  at 
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 2020-05-19T21:06:55.5808001Z  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 2020-05-19T21:06:55.5808984Z  at 
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 2020-05-19T21:06:55.5809970Z  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-05-19T21:06:55.5811188Z Caused by: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> Coordinator is suspending.
> 2020-05-19T21:06:55.5813260Z  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> 2020-05-19T21:06:55.5814556Z  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> 2020-05-19T21:06:55.5815578Z  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> 2020-05-19T21:06:55.5816604Z  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2020-05-19T21:06:55.5817663Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2020-05-19T21:06:55.5822918Z  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> 2020-05-19T21:06:55.5824096Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$null$0(CheckpointCoordinator.java:464)
> 2020-05-19T21:06:55.5825220Z  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2020-05-19T21:06:55.5826274Z  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> 2020-05-19T21:06:55.5827334Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2020-05-19T21:06:55.5828369Z  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> 2020-05-19T21:06:55.5830735Z  at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:493)
> 2020-05-19T21:06:55.5831962Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1565)
> 2020-05-19T21:06:55.5833475Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1552)
> 2020-05-19T21:06:55.5834742Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1440)
> 2020-05-19T21:06:55.5836006Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1422)
> 2020-05-19T21:06:55.5837431Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingAndQueuedCheckpoints(CheckpointCoordinator.java:1660)
> 2020-05-19T21:06:55.5838737Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1410)
> 2020-05-19T21:06:55.5840060Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:46)
> 2020-05-19T21:06:55.5841361Z  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatusChange(ExecutionGraph.java:1668)
> 2020-05-19T21:06:55.5842509Z  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1250)
> 2020-05-19T21:06:55.5843916Z  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1228)
> 2020-05-19T21:06:55.5845083Z  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.transitionExecutionGraphState(SchedulerBase.java:432)
> 2020-05-19T21:06:55.5846293Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.addVerticesToRestartPending(DefaultScheduler.java:240)
> 2020-05-19T21:06:55.5847351Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.restartTasksWithDelay(DefaultScheduler.java:227)
> 2020-05-19T21:06:55.5847998Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeRestartTasks(DefaultScheduler.java:214)
> 2020-05-19T21:06:55.5848654Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:193)
> 2020-05-19T21:06:55.5849327Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
> 2020-05-19T21:06:55.5850012Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
> 2020-05-19T21:06:55.5850701Z  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
> 2020-05-19T21:06:55.5851473Z  at 
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
> 2020-05-19T21:06:55.5852381Z  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1717)
> 2020-05-19T21:06:55.5853059Z  at 
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1268)
> 2020-05-19T21:06:55.5853663Z  at 
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1236)
> 2020-05-19T21:06:55.5854297Z  at 
> org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:954)
> 2020-05-19T21:06:55.5854938Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:173)
> 2020-05-19T21:06:55.5855620Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:165)
> 2020-05-19T21:06:55.5856296Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:732)
> 2020-05-19T21:06:55.5857025Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> 2020-05-19T21:06:55.5857747Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149)
> 2020-05-19T21:06:55.5858408Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManagerInternal(SlotPoolImpl.java:818)
> 2020-05-19T21:06:55.5859085Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManager(SlotPoolImpl.java:777)
> 2020-05-19T21:06:55.5859806Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster.disconnectTaskManager(JobMaster.java:435)
> 2020-05-19T21:06:55.5860469Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1193)
> 2020-05-19T21:06:55.5861152Z  at 
> org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109)
> 2020-05-19T21:06:55.5861751Z  at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 2020-05-19T21:06:55.5862340Z  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2020-05-19T21:06:55.5862732Z  ... 22 more
> 2020-05-19T21:06:55.5863134Z Caused by: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> Coordinator is suspending.
> 2020-05-19T21:06:55.5863754Z  at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:492)
> 2020-05-19T21:06:55.5864204Z  ... 57 more
> 2020-05-19T21:06:55.5864528Z Waiting for job 
> (a92a74de8446a80403798bb4806b73f3) to reach terminal state FINISHED ...
> 2020-05-20T00:30:52.9000401Z ##[error]The operation was canceled.
> 2020-05-20T00:30:52.9019065Z ##[section]Finishing: Run e2e tests
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to