[ 
https://issues.apache.org/jira/browse/FLINK-17824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135572#comment-17135572
 ] 

Roman Khachatryan commented on FLINK-17824:
-------------------------------------------

> Is it taking couple of minutes to process the remaining data? 
Yes. Also just to stop the job without savepoint.

I didn't find anything abnormal, just too much records generated and buffered 
by sources in case of any delay in downstreams.
Without the delay, I guess source and downstreams compete for memory buffers.

> Can not we speed up the test? 
Increasing job parameter sequence_generator_source.sleep_time from 15 to 30 to 
30 fixes the issue (locally).

 Decreasing it to 5 makes the test to fail always (each source generates up to 
seven 32K buffers).

Similar effect have segment-size and number of segments.

 

I suggest to increase sequence_generator_source.sleep_time and consider a 
proper fix in future.

> "Resuming Savepoint" e2e stalls indefinitely 
> ---------------------------------------------
>
>                 Key: FLINK-17824
>                 URL: https://issues.apache.org/jira/browse/FLINK-17824
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Tests
>    Affects Versions: 1.10.1, 1.11.0
>            Reporter: Robert Metzger
>            Assignee: Roman Khachatryan
>            Priority: Critical
>              Labels: pull-request-available, test-stability
>             Fix For: 1.12.0
>
>
> CI; 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1887&view=logs&j=91bf6583-3fb2-592f-e4d4-d79d79c3230a&t=94459a52-42b6-5bfc-5d74-690b5d3c6de8
> {code}
> 2020-05-19T21:05:52.9696236Z 
> ==============================================================================
> 2020-05-19T21:05:52.9696860Z Running 'Resuming Savepoint (file, async, scale 
> down) end-to-end test'
> 2020-05-19T21:05:52.9697243Z 
> ==============================================================================
> 2020-05-19T21:05:52.9713094Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-52970362751
> 2020-05-19T21:05:53.1194478Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.12-SNAPSHOT-bin/flink-1.12-SNAPSHOT
> 2020-05-19T21:05:53.2180375Z Starting cluster.
> 2020-05-19T21:05:53.9986167Z Starting standalonesession daemon on host 
> fv-az558.
> 2020-05-19T21:05:55.5997224Z Starting taskexecutor daemon on host fv-az558.
> 2020-05-19T21:05:55.6223837Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-05-19T21:05:57.0552482Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-05-19T21:05:57.9446865Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-05-19T21:05:59.0098434Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-05-19T21:06:00.0569710Z Dispatcher REST endpoint is up.
> 2020-05-19T21:06:07.7099937Z Job (a92a74de8446a80403798bb4806b73f3) is 
> running.
> 2020-05-19T21:06:07.7855906Z Waiting for job to process up to 200 records, 
> current progress: 114 records ...
> 2020-05-19T21:06:55.5755111Z 
> 2020-05-19T21:06:55.5756550Z 
> ------------------------------------------------------------
> 2020-05-19T21:06:55.5757225Z  The program finished with the following 
> exception:
> 2020-05-19T21:06:55.5757566Z 
> 2020-05-19T21:06:55.5765453Z org.apache.flink.util.FlinkException: Could not 
> stop with a savepoint job "a92a74de8446a80403798bb4806b73f3".
> 2020-05-19T21:06:55.5766873Z  at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:485)
> 2020-05-19T21:06:55.5767980Z  at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:854)
> 2020-05-19T21:06:55.5769014Z  at 
> org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:477)
> 2020-05-19T21:06:55.5770052Z  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:921)
> 2020-05-19T21:06:55.5771107Z  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:982)
> 2020-05-19T21:06:55.5772223Z  at 
> org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
> 2020-05-19T21:06:55.5773325Z  at 
> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:982)
> 2020-05-19T21:06:55.5774871Z Caused by: 
> java.util.concurrent.ExecutionException: 
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> Coordinator is suspending.
> 2020-05-19T21:06:55.5777183Z  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 2020-05-19T21:06:55.5778884Z  at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> 2020-05-19T21:06:55.5779920Z  at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:483)
> 2020-05-19T21:06:55.5781175Z  ... 6 more
> 2020-05-19T21:06:55.5782391Z Caused by: 
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> Coordinator is suspending.
> 2020-05-19T21:06:55.5783885Z  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$stopWithSavepoint$9(SchedulerBase.java:890)
> 2020-05-19T21:06:55.5784992Z  at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> 2020-05-19T21:06:55.5786492Z  at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> 2020-05-19T21:06:55.5787601Z  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
> 2020-05-19T21:06:55.5788682Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> 2020-05-19T21:06:55.5790308Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> 2020-05-19T21:06:55.5791664Z  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> 2020-05-19T21:06:55.5792767Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> 2020-05-19T21:06:55.5793756Z  at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> 2020-05-19T21:06:55.5794652Z  at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> 2020-05-19T21:06:55.5795605Z  at 
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> 2020-05-19T21:06:55.5796551Z  at 
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> 2020-05-19T21:06:55.5797459Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> 2020-05-19T21:06:55.5798390Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-05-19T21:06:55.5799311Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-05-19T21:06:55.5800175Z  at 
> akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> 2020-05-19T21:06:55.5801078Z  at 
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> 2020-05-19T21:06:55.5802741Z  at 
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> 2020-05-19T21:06:55.5803579Z  at 
> akka.actor.ActorCell.invoke(ActorCell.scala:561)
> 2020-05-19T21:06:55.5804628Z  at 
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> 2020-05-19T21:06:55.5805435Z  at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> 2020-05-19T21:06:55.5806194Z  at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> 2020-05-19T21:06:55.5807037Z  at 
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 2020-05-19T21:06:55.5808001Z  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 2020-05-19T21:06:55.5808984Z  at 
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 2020-05-19T21:06:55.5809970Z  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-05-19T21:06:55.5811188Z Caused by: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> Coordinator is suspending.
> 2020-05-19T21:06:55.5813260Z  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> 2020-05-19T21:06:55.5814556Z  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> 2020-05-19T21:06:55.5815578Z  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> 2020-05-19T21:06:55.5816604Z  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2020-05-19T21:06:55.5817663Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2020-05-19T21:06:55.5822918Z  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> 2020-05-19T21:06:55.5824096Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$null$0(CheckpointCoordinator.java:464)
> 2020-05-19T21:06:55.5825220Z  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2020-05-19T21:06:55.5826274Z  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> 2020-05-19T21:06:55.5827334Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2020-05-19T21:06:55.5828369Z  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> 2020-05-19T21:06:55.5830735Z  at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:493)
> 2020-05-19T21:06:55.5831962Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1565)
> 2020-05-19T21:06:55.5833475Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1552)
> 2020-05-19T21:06:55.5834742Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1440)
> 2020-05-19T21:06:55.5836006Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1422)
> 2020-05-19T21:06:55.5837431Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingAndQueuedCheckpoints(CheckpointCoordinator.java:1660)
> 2020-05-19T21:06:55.5838737Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1410)
> 2020-05-19T21:06:55.5840060Z  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:46)
> 2020-05-19T21:06:55.5841361Z  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatusChange(ExecutionGraph.java:1668)
> 2020-05-19T21:06:55.5842509Z  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1250)
> 2020-05-19T21:06:55.5843916Z  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1228)
> 2020-05-19T21:06:55.5845083Z  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.transitionExecutionGraphState(SchedulerBase.java:432)
> 2020-05-19T21:06:55.5846293Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.addVerticesToRestartPending(DefaultScheduler.java:240)
> 2020-05-19T21:06:55.5847351Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.restartTasksWithDelay(DefaultScheduler.java:227)
> 2020-05-19T21:06:55.5847998Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeRestartTasks(DefaultScheduler.java:214)
> 2020-05-19T21:06:55.5848654Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:193)
> 2020-05-19T21:06:55.5849327Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
> 2020-05-19T21:06:55.5850012Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
> 2020-05-19T21:06:55.5850701Z  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
> 2020-05-19T21:06:55.5851473Z  at 
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
> 2020-05-19T21:06:55.5852381Z  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1717)
> 2020-05-19T21:06:55.5853059Z  at 
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1268)
> 2020-05-19T21:06:55.5853663Z  at 
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1236)
> 2020-05-19T21:06:55.5854297Z  at 
> org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:954)
> 2020-05-19T21:06:55.5854938Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:173)
> 2020-05-19T21:06:55.5855620Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:165)
> 2020-05-19T21:06:55.5856296Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:732)
> 2020-05-19T21:06:55.5857025Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> 2020-05-19T21:06:55.5857747Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149)
> 2020-05-19T21:06:55.5858408Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManagerInternal(SlotPoolImpl.java:818)
> 2020-05-19T21:06:55.5859085Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManager(SlotPoolImpl.java:777)
> 2020-05-19T21:06:55.5859806Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster.disconnectTaskManager(JobMaster.java:435)
> 2020-05-19T21:06:55.5860469Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1193)
> 2020-05-19T21:06:55.5861152Z  at 
> org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109)
> 2020-05-19T21:06:55.5861751Z  at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 2020-05-19T21:06:55.5862340Z  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2020-05-19T21:06:55.5862732Z  ... 22 more
> 2020-05-19T21:06:55.5863134Z Caused by: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> Coordinator is suspending.
> 2020-05-19T21:06:55.5863754Z  at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:492)
> 2020-05-19T21:06:55.5864204Z  ... 57 more
> 2020-05-19T21:06:55.5864528Z Waiting for job 
> (a92a74de8446a80403798bb4806b73f3) to reach terminal state FINISHED ...
> 2020-05-20T00:30:52.9000401Z ##[error]The operation was canceled.
> 2020-05-20T00:30:52.9019065Z ##[section]Finishing: Run e2e tests
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to