Ryan Skraba created FLINK-35499: ----------------------------------- Summary: EventTimeWindowCheckpointingITCase times out due to Checkpoint expired before completing Key: FLINK-35499 URL: https://issues.apache.org/jira/browse/FLINK-35499 Project: Flink Issue Type: Bug Affects Versions: 1.20.0 Reporter: Ryan Skraba
* 1.20 AdaptiveScheduler / Test (module: tests) https://github.com/apache/flink/actions/runs/9311892945/job/25632037990#step:10:8702 * 1.20 Default (Java 8) / Test (module: tests) https://github.com/apache/flink/actions/runs/9275522134/job/25520829730#step:10:8264 Going into the logs, we see the following error occurs: {code:java} ================================================================================ Test testTumblingTimeWindow[statebackend type =ROCKSDB_INCREMENTAL, buffersPerChannel = 2](org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase) is running. -------------------------------------------------------------------------------- <...> 20:24:23,562 [ Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 22 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863562 for job 15d0a663cb415b09b9a68ccc40640c6d. 20:24:23,609 [jobmanager-io-thread-2] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 22 for job 15d0a663cb415b09b9a68ccc40640c6d (2349132 bytes, checkpointDuration=43 ms, finalizationTime=4 ms). 20:24:23,610 [ Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 23 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863610 for job 15d0a663cb415b09b9a68ccc40640c6d. 20:24:23,620 [jobmanager-io-thread-2] WARN org.apache.flink.runtime.jobmaster.JobMaster [] - Error while processing AcknowledgeCheckpoint message java.lang.IllegalStateException: Attempt to reference unknown state: a9a90973-4ee5-384f-acef-58a7c7560920 at org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.state.SharedStateRegistryImpl.registerReference(SharedStateRegistryImpl.java:97) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.state.SharedStateRegistry.registerReference(SharedStateRegistry.java:53) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.state.IncrementalRemoteKeyedStateHandle.registerSharedStates(IncrementalRemoteKeyedStateHandle.java:289) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedState(OperatorSubtaskState.java:243) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedStates(OperatorSubtaskState.java:226) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.checkpoint.TaskStateSnapshot.registerSharedStates(TaskStateSnapshot.java:193) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1245) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$2(ExecutionGraphHandler.java:109) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$4(ExecutionGraphHandler.java:139) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_392] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_392] at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392] 20:24:23,663 [Source: Custom Source (1/1)#1] INFO org.apache.flink.runtime.taskmanager.Task [] - Source: Custom Source (1/1)#1 (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1) switched from RUNNING to FINISHED. 20:24:23,663 [Source: Custom Source (1/1)#1] INFO org.apache.flink.runtime.taskmanager.Task [] - Freeing task resources for Source: Custom Source (1/1)#1 (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1). 20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Un-registering task and sending final execution state FINISHED to JobManager for task Source: Custom Source (1/1)#1 bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1. 20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source (1/1) (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1) switched from RUNNING to FINISHED. 20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Triggering a manual checkpoint for job 15d0a663cb415b09b9a68ccc40640c6d. 20:24:23,664 [TumblingEventTimeWindows (3/4)#1] INFO org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend [] - Closed RocksDB State Backend. Cleaning up RocksDB working directory /tmp/junit4600599408056781950/junit4118530363439384731/job_15d0a663cb415b09b9a68ccc40640c6d_op_WindowOperator_0a448493b4782967b150582570326227__3_4__uuid_f9b935a1-38cc-4cb3-92f0-97bf257d499d. 20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO org.apache.flink.runtime.taskmanager.Task [] - TumblingEventTimeWindows (3/4)#1 (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1) switched from RUNNING to FINISHED. 20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO org.apache.flink.runtime.taskmanager.Task [] - Freeing task resources for TumblingEventTimeWindows (3/4)#1 (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1). 20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Un-registering task and sending final execution state FINISHED to JobManager for task TumblingEventTimeWindows (3/4)#1 bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1. 20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - TumblingEventTimeWindows (3/4) (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1) switched from RUNNING to FINISHED. 20:24:23,665 [flink-pekko.actor.default-dispatcher-10] INFO org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] - Received resource requirements from job 15d0a663cb415b09b9a68ccc40640c6d: [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, numberOfRequiredSlots=3}] 20:26:01,189 [flink-pekko.actor.default-dispatcher-16] INFO org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [27b0503b2930b17041ffd3635dfb819f]. 20:26:01,190 [flink-pekko.actor.default-dispatcher-16] INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:3, state:ACTIVE, resource profile: ResourceProfile{taskHeapMemory=512.000gb (549755813888 bytes), taskOffHeapMemory=512.000gb (549755813888 bytes), managedMemory=64.000mb (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}, allocationId: 27b0503b2930b17041ffd3635dfb819f, jobId: 15d0a663cb415b09b9a68ccc40640c6d). 20:26:01,205 [flink-pekko.actor.default-dispatcher-16] INFO org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer [] - Freeing slot 27b0503b2930b17041ffd3635dfb819f. 20:34:23,610 [ Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint 23 of job 15d0a663cb415b09b9a68ccc40640c6d expired before completing. 20:34:23,611 [ Checkpoint Timer] WARN org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 23 for job 15d0a663cb415b09b9a68ccc40640c6d. (0 consecutive failed attempts so far) org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint expired before completing. at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_392] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_392] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_392] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_392] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_392] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_392] at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392] 20:34:23,611 [ Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointRequestDecider [] - checkpoint request time in queue: 599948 20:34:23,612 [flink-pekko.actor.default-dispatcher-130] INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to recover from a global failure. org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. The latest checkpoint failed due to Checkpoint expired before completing., view the Checkpoint History tab or the Job Manager log to find out why continuous checkpoints failed. at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.checkFailureAgainstCounter(CheckpointFailureManager.java:212) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:169) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:122) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2281) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2260) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$1200(CheckpointCoordinator.java:102) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_392] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_392] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_392] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_392] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_392] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_392] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392] 20:34:23,613 [flink-pekko.actor.default-dispatcher-130] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job Tumbling Window Test (15d0a663cb415b09b9a68ccc40640c6d) switched from state RUNNING to FAILING. org.apache.flink.runtime.JobException: Recovery is suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=1, backoffTimeMS=0) at org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:219) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailureAndReport(ExecutionFailureHandler.java:166) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.getGlobalFailureHandlingResult(ExecutionFailureHandler.java:140) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:324) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyGlobalFailure(UpdateSchedulerNgOnInternalFailuresListener.java:57) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.failGlobal(DefaultExecutionGraph.java:1092) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph$1.lambda$failJob$0(DefaultExecutionGraph.java:477) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460) ~[flink-rpc-akka44fe192f-1cb4-4526-9c09-a5d6b8d70d87.jar:1.20-SNAPSHOT] at org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) ~[flink-rpc-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] <...> {code} Notably: {{org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. The latest checkpoint failed due to Checkpoint expired before completing., view the Checkpoint History tab or the Job Manager log to find out why continuous checkpoints failed.}} -- This message was sent by Atlassian Jira (v8.20.10#820010)