[ https://issues.apache.org/jira/browse/FLINK-17824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135897#comment-17135897 ]
Piotr Nowojski commented on FLINK-17824: ---------------------------------------- Ok, let's try to increase the sleep_time. But this parameter doesn't answer this question: {quote} > 6. Further "network tasks" do NOT consume EndOfPartition event for several > minutes - delaying finishing job and causing test failure Why is it taking so long? Can not we speed up the test? Is it taking couple of minutes to process the remaining data? It sounds excessive. {quote} If I understand your previous message, it takes several minutes for downstream tasks to consume buffered data - data that has already been produced by the source, so this sleep time should have no affect on that. So my question remains open, why is it taking couple of minutes to process the remaining data? > "Resuming Savepoint" e2e stalls indefinitely > --------------------------------------------- > > Key: FLINK-17824 > URL: https://issues.apache.org/jira/browse/FLINK-17824 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests > Affects Versions: 1.10.1, 1.11.0 > Reporter: Robert Metzger > Assignee: Roman Khachatryan > Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.12.0 > > > CI; > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1887&view=logs&j=91bf6583-3fb2-592f-e4d4-d79d79c3230a&t=94459a52-42b6-5bfc-5d74-690b5d3c6de8 > {code} > 2020-05-19T21:05:52.9696236Z > ============================================================================== > 2020-05-19T21:05:52.9696860Z Running 'Resuming Savepoint (file, async, scale > down) end-to-end test' > 2020-05-19T21:05:52.9697243Z > ============================================================================== > 2020-05-19T21:05:52.9713094Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-52970362751 > 2020-05-19T21:05:53.1194478Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.12-SNAPSHOT-bin/flink-1.12-SNAPSHOT > 2020-05-19T21:05:53.2180375Z Starting cluster. > 2020-05-19T21:05:53.9986167Z Starting standalonesession daemon on host > fv-az558. > 2020-05-19T21:05:55.5997224Z Starting taskexecutor daemon on host fv-az558. > 2020-05-19T21:05:55.6223837Z Waiting for Dispatcher REST endpoint to come > up... > 2020-05-19T21:05:57.0552482Z Waiting for Dispatcher REST endpoint to come > up... > 2020-05-19T21:05:57.9446865Z Waiting for Dispatcher REST endpoint to come > up... > 2020-05-19T21:05:59.0098434Z Waiting for Dispatcher REST endpoint to come > up... > 2020-05-19T21:06:00.0569710Z Dispatcher REST endpoint is up. > 2020-05-19T21:06:07.7099937Z Job (a92a74de8446a80403798bb4806b73f3) is > running. > 2020-05-19T21:06:07.7855906Z Waiting for job to process up to 200 records, > current progress: 114 records ... > 2020-05-19T21:06:55.5755111Z > 2020-05-19T21:06:55.5756550Z > ------------------------------------------------------------ > 2020-05-19T21:06:55.5757225Z The program finished with the following > exception: > 2020-05-19T21:06:55.5757566Z > 2020-05-19T21:06:55.5765453Z org.apache.flink.util.FlinkException: Could not > stop with a savepoint job "a92a74de8446a80403798bb4806b73f3". > 2020-05-19T21:06:55.5766873Z at > org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:485) > 2020-05-19T21:06:55.5767980Z at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:854) > 2020-05-19T21:06:55.5769014Z at > org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:477) > 2020-05-19T21:06:55.5770052Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:921) > 2020-05-19T21:06:55.5771107Z at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:982) > 2020-05-19T21:06:55.5772223Z at > org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) > 2020-05-19T21:06:55.5773325Z at > org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:982) > 2020-05-19T21:06:55.5774871Z Caused by: > java.util.concurrent.ExecutionException: > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint > Coordinator is suspending. > 2020-05-19T21:06:55.5777183Z at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > 2020-05-19T21:06:55.5778884Z at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) > 2020-05-19T21:06:55.5779920Z at > org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:483) > 2020-05-19T21:06:55.5781175Z ... 6 more > 2020-05-19T21:06:55.5782391Z Caused by: > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint > Coordinator is suspending. > 2020-05-19T21:06:55.5783885Z at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$stopWithSavepoint$9(SchedulerBase.java:890) > 2020-05-19T21:06:55.5784992Z at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > 2020-05-19T21:06:55.5786492Z at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > 2020-05-19T21:06:55.5787601Z at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > 2020-05-19T21:06:55.5788682Z at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) > 2020-05-19T21:06:55.5790308Z at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) > 2020-05-19T21:06:55.5791664Z at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) > 2020-05-19T21:06:55.5792767Z at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) > 2020-05-19T21:06:55.5793756Z at > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > 2020-05-19T21:06:55.5794652Z at > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > 2020-05-19T21:06:55.5795605Z at > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > 2020-05-19T21:06:55.5796551Z at > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > 2020-05-19T21:06:55.5797459Z at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > 2020-05-19T21:06:55.5798390Z at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > 2020-05-19T21:06:55.5799311Z at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > 2020-05-19T21:06:55.5800175Z at > akka.actor.Actor$class.aroundReceive(Actor.scala:517) > 2020-05-19T21:06:55.5801078Z at > akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > 2020-05-19T21:06:55.5802741Z at > akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > 2020-05-19T21:06:55.5803579Z at > akka.actor.ActorCell.invoke(ActorCell.scala:561) > 2020-05-19T21:06:55.5804628Z at > akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > 2020-05-19T21:06:55.5805435Z at akka.dispatch.Mailbox.run(Mailbox.scala:225) > 2020-05-19T21:06:55.5806194Z at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > 2020-05-19T21:06:55.5807037Z at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > 2020-05-19T21:06:55.5808001Z at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > 2020-05-19T21:06:55.5808984Z at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > 2020-05-19T21:06:55.5809970Z at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2020-05-19T21:06:55.5811188Z Caused by: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint > Coordinator is suspending. > 2020-05-19T21:06:55.5813260Z at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > 2020-05-19T21:06:55.5814556Z at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > 2020-05-19T21:06:55.5815578Z at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) > 2020-05-19T21:06:55.5816604Z at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) > 2020-05-19T21:06:55.5817663Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2020-05-19T21:06:55.5822918Z at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > 2020-05-19T21:06:55.5824096Z at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$null$0(CheckpointCoordinator.java:464) > 2020-05-19T21:06:55.5825220Z at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > 2020-05-19T21:06:55.5826274Z at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > 2020-05-19T21:06:55.5827334Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2020-05-19T21:06:55.5828369Z at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > 2020-05-19T21:06:55.5830735Z at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:493) > 2020-05-19T21:06:55.5831962Z at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1565) > 2020-05-19T21:06:55.5833475Z at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1552) > 2020-05-19T21:06:55.5834742Z at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1440) > 2020-05-19T21:06:55.5836006Z at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1422) > 2020-05-19T21:06:55.5837431Z at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingAndQueuedCheckpoints(CheckpointCoordinator.java:1660) > 2020-05-19T21:06:55.5838737Z at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1410) > 2020-05-19T21:06:55.5840060Z at > org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:46) > 2020-05-19T21:06:55.5841361Z at > org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatusChange(ExecutionGraph.java:1668) > 2020-05-19T21:06:55.5842509Z at > org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1250) > 2020-05-19T21:06:55.5843916Z at > org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1228) > 2020-05-19T21:06:55.5845083Z at > org.apache.flink.runtime.scheduler.SchedulerBase.transitionExecutionGraphState(SchedulerBase.java:432) > 2020-05-19T21:06:55.5846293Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.addVerticesToRestartPending(DefaultScheduler.java:240) > 2020-05-19T21:06:55.5847351Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.restartTasksWithDelay(DefaultScheduler.java:227) > 2020-05-19T21:06:55.5847998Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.maybeRestartTasks(DefaultScheduler.java:214) > 2020-05-19T21:06:55.5848654Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:193) > 2020-05-19T21:06:55.5849327Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185) > 2020-05-19T21:06:55.5850012Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179) > 2020-05-19T21:06:55.5850701Z at > org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503) > 2020-05-19T21:06:55.5851473Z at > org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49) > 2020-05-19T21:06:55.5852381Z at > org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1717) > 2020-05-19T21:06:55.5853059Z at > org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1268) > 2020-05-19T21:06:55.5853663Z at > org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1236) > 2020-05-19T21:06:55.5854297Z at > org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:954) > 2020-05-19T21:06:55.5854938Z at > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:173) > 2020-05-19T21:06:55.5855620Z at > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:165) > 2020-05-19T21:06:55.5856296Z at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:732) > 2020-05-19T21:06:55.5857025Z at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537) > 2020-05-19T21:06:55.5857747Z at > org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149) > 2020-05-19T21:06:55.5858408Z at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManagerInternal(SlotPoolImpl.java:818) > 2020-05-19T21:06:55.5859085Z at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManager(SlotPoolImpl.java:777) > 2020-05-19T21:06:55.5859806Z at > org.apache.flink.runtime.jobmaster.JobMaster.disconnectTaskManager(JobMaster.java:435) > 2020-05-19T21:06:55.5860469Z at > org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1193) > 2020-05-19T21:06:55.5861152Z at > org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109) > 2020-05-19T21:06:55.5861751Z at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > 2020-05-19T21:06:55.5862340Z at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > 2020-05-19T21:06:55.5862732Z ... 22 more > 2020-05-19T21:06:55.5863134Z Caused by: > org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint > Coordinator is suspending. > 2020-05-19T21:06:55.5863754Z at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:492) > 2020-05-19T21:06:55.5864204Z ... 57 more > 2020-05-19T21:06:55.5864528Z Waiting for job > (a92a74de8446a80403798bb4806b73f3) to reach terminal state FINISHED ... > 2020-05-20T00:30:52.9000401Z ##[error]The operation was canceled. > 2020-05-20T00:30:52.9019065Z ##[section]Finishing: Run e2e tests > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)