[ https://issues.apache.org/jira/browse/FLINK-32348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738852#comment-17738852 ]
Jiabao Sun commented on FLINK-32348: ------------------------------------ Hi [~martijnvisser] I tried to investigate and reproduce the issue, and found that when the `AsyncCheckpointRunnable` meets `CancellationException`, the task never stops as expected. I think this problem may relate to [FLINK-25902|https://issues.apache.org/jira/browse/FLINK-25902]. The root cause of this problem remains to be further investigated. {code:sh} 00:30:26,533 [flink-akka.actor.default-dispatcher-10] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering Checkpoint 4 for job 79765d8c304b804a1adbd3677bc39708 failed due to org.apache.flink.runtime.checkpoint.CheckpointException: TaskManager received a checkpoint request for unknown task 6585a08f46e2d380ebe0ac7fde3739a7_cbc357ccb763df2852fee8c4fc7d55f2_1_1. Failure reason: Task local checkpoint failure. 00:30:26,533 [ Checkpoint Timer] WARN org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 4 for job 79765d8c304b804a1adbd3677bc39708. (0 consecutive failed attempts so far) org.apache.flink.runtime.checkpoint.CheckpointException: TaskManager received a checkpoint request for unknown task 6585a08f46e2d380ebe0ac7fde3739a7_cbc357ccb763df2852fee8c4fc7d55f2_1_1. Failure reason: Task local checkpoint failure. at org.apache.flink.runtime.taskexecutor.TaskExecutor.triggerCheckpoint(TaskExecutor.java:1025) ~[flink-runtime-1.17-SNAPSHOT.jar:1.17-SNAPSHOT] at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source) ~[?:?] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_372] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_372] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:309) ~[?:?] at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83) ~[?:?] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:307) ~[?:?] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:222) ~[?:?] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:168) ~[?:?] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) ~[?:?] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) ~[?:?] at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) ~[?:?] at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) ~[?:?] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) ~[?:?] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175) ~[?:?] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) ~[?:?] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) ~[?:?] at akka.actor.Actor.aroundReceive(Actor.scala:537) ~[?:?] at akka.actor.Actor.aroundReceive$(Actor.scala:535) ~[?:?] at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) ~[?:?] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:579) ~[?:?] at akka.actor.ActorCell.invoke(ActorCell.scala:547) ~[?:?] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) ~[?:?] at akka.dispatch.Mailbox.run(Mailbox.scala:231) ~[?:?] at akka.dispatch.Mailbox.exec(Mailbox.scala:243) ~[?:?] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[?:1.8.0_372] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) ~[?:1.8.0_372] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[?:1.8.0_372] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) ~[?:1.8.0_372] 00:30:26,554 [AsyncOperations-thread-1] INFO org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - Sink: Data stream collect sink (1/1)#1 - asynchronous part of checkpoint 4 could not be completed. java.util.concurrent.CancellationException: null at java.util.concurrent.FutureTask.report(FutureTask.java:121) ~[?:1.8.0_372] at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_372] at org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544) ~[flink-core-1.17-SNAPSHOT.jar:1.17-SNAPSHOT] at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:60) ~[flink-streaming-java-1.17-SNAPSHOT.jar:1.17-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191) ~[flink-streaming-java-1.17-SNAPSHOT.jar:1.17-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124) ~[flink-streaming-java-1.17-SNAPSHOT.jar:1.17-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_372] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_372] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_372] 00:30:26,559 [jobmanager-io-thread-2] WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late message for now expired checkpoint attempt 4 from task 6585a08f46e2d380ebe0ac7fde3739a7_cbc357ccb763df2852fee8c4fc7d55f2_0_1 of job 79765d8c304b804a1adbd3677bc39708 at c5c049a8-08be-4625-9431-5e9d1f75ba01 @ localhost (dataPort=41367). 00:30:26,609 [ Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 5 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1686443426609 for job 79765d8c304b804a1adbd3677bc39708. 00:30:26,899 [jobmanager-io-thread-2] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 5 for job 79765d8c304b804a1adbd3677bc39708 (2327692 bytes, checkpointDuration=290 ms, finalizationTime=0 ms). 00:30:26,906 [SourceCoordinator-Source: MongoDB-Source] INFO org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Marking checkpoint 5 as completed for source Source: MongoDB-Source. 00:30:26,906 [ Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 6 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1686443426906 for job 79765d8c304b804a1adbd3677bc39708. 00:30:27,170 [jobmanager-io-thread-1] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 6 for job 79765d8c304b804a1adbd3677bc39708 (2327692 bytes, checkpointDuration=263 ms, finalizationTime=1 ms). 00:30:27,170 [ Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointRequestDecider [] - checkpoint request time in queue: 161 00:30:27,171 [SourceCoordinator-Source: MongoDB-Source] INFO org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Marking checkpoint 6 as completed for source Source: MongoDB-Source. 00:30:27,171 [ Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 7 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1686443427170 for job 79765d8c304b804a1adbd3677bc39708. 00:30:27,424 [jobmanager-io-thread-1] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 7 for job 79765d8c304b804a1adbd3677bc39708 (2327692 bytes, checkpointDuration=254 ms, finalizationTime=0 ms). 00:30:27,427 [SourceCoordinator-Source: MongoDB-Source] INFO org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Marking checkpoint 7 as completed for source Source: MongoDB-Source. {code} > MongoDB tests are flaky and time out > ------------------------------------ > > Key: FLINK-32348 > URL: https://issues.apache.org/jira/browse/FLINK-32348 > Project: Flink > Issue Type: Bug > Components: Connectors / MongoDB > Reporter: Martijn Visser > Priority: Critical > Labels: test-stability > > https://github.com/apache/flink-connector-mongodb/actions/runs/5232649632/jobs/9447519651#step:13:39307 -- This message was sent by Atlassian Jira (v8.20.10#820010)