Ryan Skraba created FLINK-35499:
-----------------------------------

             Summary: EventTimeWindowCheckpointingITCase times out due to 
Checkpoint expired before completing
                 Key: FLINK-35499
                 URL: https://issues.apache.org/jira/browse/FLINK-35499
             Project: Flink
          Issue Type: Bug
    Affects Versions: 1.20.0
            Reporter: Ryan Skraba


* 1.20 AdaptiveScheduler / Test (module: tests) 
https://github.com/apache/flink/actions/runs/9311892945/job/25632037990#step:10:8702
* 1.20 Default (Java 8) / Test (module: tests) 
https://github.com/apache/flink/actions/runs/9275522134/job/25520829730#step:10:8264

Going into the logs, we see the following error occurs:
{code:java}
================================================================================
Test testTumblingTimeWindow[statebackend type =ROCKSDB_INCREMENTAL, 
buffersPerChannel = 
2](org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase) is 
running.
--------------------------------------------------------------------------------
<...>
20:24:23,562 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 22 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863562 for job 
15d0a663cb415b09b9a68ccc40640c6d.
20:24:23,609 [jobmanager-io-thread-2] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 22 for job 15d0a663cb415b09b9a68ccc40640c6d (2349132 bytes, 
checkpointDuration=43 ms, finalizationTime=4 ms).
20:24:23,610 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 23 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863610 for job 
15d0a663cb415b09b9a68ccc40640c6d.
20:24:23,620 [jobmanager-io-thread-2] WARN  
org.apache.flink.runtime.jobmaster.JobMaster                 [] - Error while 
processing AcknowledgeCheckpoint message
java.lang.IllegalStateException: Attempt to reference unknown state: 
a9a90973-4ee5-384f-acef-58a7c7560920
        at 
org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) 
~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.state.SharedStateRegistryImpl.registerReference(SharedStateRegistryImpl.java:97)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.state.SharedStateRegistry.registerReference(SharedStateRegistry.java:53)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.state.IncrementalRemoteKeyedStateHandle.registerSharedStates(IncrementalRemoteKeyedStateHandle.java:289)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedState(OperatorSubtaskState.java:243)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedStates(OperatorSubtaskState.java:226)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.TaskStateSnapshot.registerSharedStates(TaskStateSnapshot.java:193)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1245)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$2(ExecutionGraphHandler.java:109)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$4(ExecutionGraphHandler.java:139)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) 
~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_392]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_392]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392]
20:24:23,663 [Source: Custom Source (1/1)#1] INFO  
org.apache.flink.runtime.taskmanager.Task                    [] - Source: 
Custom Source (1/1)#1 
(bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1) 
switched from RUNNING to FINISHED.
20:24:23,663 [Source: Custom Source (1/1)#1] INFO  
org.apache.flink.runtime.taskmanager.Task                    [] - Freeing task 
resources for Source: Custom Source (1/1)#1 
(bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1).
20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - 
Un-registering task and sending final execution state FINISHED to JobManager 
for task Source: Custom Source (1/1)#1 
bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1.
20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
Custom Source (1/1) 
(bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1) 
switched from RUNNING to FINISHED.
20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO  
org.apache.flink.runtime.jobmaster.JobMaster                 [] - Triggering a 
manual checkpoint for job 15d0a663cb415b09b9a68ccc40640c6d.
20:24:23,664 [TumblingEventTimeWindows (3/4)#1] INFO  
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend [] - Closed 
RocksDB State Backend. Cleaning up RocksDB working directory 
/tmp/junit4600599408056781950/junit4118530363439384731/job_15d0a663cb415b09b9a68ccc40640c6d_op_WindowOperator_0a448493b4782967b150582570326227__3_4__uuid_f9b935a1-38cc-4cb3-92f0-97bf257d499d.
20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO  
org.apache.flink.runtime.taskmanager.Task                    [] - 
TumblingEventTimeWindows (3/4)#1 
(bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1) 
switched from RUNNING to FINISHED.
20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO  
org.apache.flink.runtime.taskmanager.Task                    [] - Freeing task 
resources for TumblingEventTimeWindows (3/4)#1 
(bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1).
20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - 
Un-registering task and sending final execution state FINISHED to JobManager 
for task TumblingEventTimeWindows (3/4)#1 
bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1.
20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - 
TumblingEventTimeWindows (3/4) 
(bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1) 
switched from RUNNING to FINISHED.
20:24:23,665 [flink-pekko.actor.default-dispatcher-10] INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] 
- Received resource requirements from job 15d0a663cb415b09b9a68ccc40640c6d: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=3}]
20:26:01,189 [flink-pekko.actor.default-dispatcher-16] INFO  
org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - 
Releasing slot [27b0503b2930b17041ffd3635dfb819f].
20:26:01,190 [flink-pekko.actor.default-dispatcher-16] INFO  
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot 
TaskSlot(index:3, state:ACTIVE, resource profile: 
ResourceProfile{taskHeapMemory=512.000gb (549755813888 bytes), 
taskOffHeapMemory=512.000gb (549755813888 bytes), managedMemory=64.000mb 
(67108864 bytes), networkMemory=32.000mb (33554432 bytes)}, allocationId: 
27b0503b2930b17041ffd3635dfb819f, jobId: 15d0a663cb415b09b9a68ccc40640c6d).
20:26:01,205 [flink-pekko.actor.default-dispatcher-16] INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer [] 
- Freeing slot 27b0503b2930b17041ffd3635dfb819f.
20:34:23,610 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 23 
of job 15d0a663cb415b09b9a68ccc40640c6d expired before completing.
20:34:23,611 [    Checkpoint Timer] WARN  
org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to 
trigger or complete checkpoint 23 for job 15d0a663cb415b09b9a68ccc40640c6d. (0 
consecutive failed attempts so far)
org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint expired 
before completing.
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) 
~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[?:1.8.0_392]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
[?:1.8.0_392]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 [?:1.8.0_392]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 [?:1.8.0_392]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_392]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_392]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392]
20:34:23,611 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointRequestDecider [] - checkpoint 
request time in queue: 599948
20:34:23,612 [flink-pekko.actor.default-dispatcher-130] INFO  
org.apache.flink.runtime.jobmaster.JobMaster                 [] - Trying to 
recover from a global failure.
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable 
failure threshold. The latest checkpoint failed due to Checkpoint expired 
before completing., view the Checkpoint History tab or the Job Manager log to 
find out why continuous checkpoints failed.
        at 
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.checkFailureAgainstCounter(CheckpointFailureManager.java:212)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:169)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:122)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2281)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2260)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$1200(CheckpointCoordinator.java:102)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) 
~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[?:1.8.0_392]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
~[?:1.8.0_392]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 ~[?:1.8.0_392]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 ~[?:1.8.0_392]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_392]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_392]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392]
20:34:23,613 [flink-pekko.actor.default-dispatcher-130] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job Tumbling 
Window Test (15d0a663cb415b09b9a68ccc40640c6d) switched from state RUNNING to 
FAILING.
org.apache.flink.runtime.JobException: Recovery is suppressed by 
FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=1, 
backoffTimeMS=0)
        at 
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:219)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailureAndReport(ExecutionFailureHandler.java:166)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.getGlobalFailureHandlingResult(ExecutionFailureHandler.java:140)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:324)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyGlobalFailure(UpdateSchedulerNgOnInternalFailuresListener.java:57)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.failGlobal(DefaultExecutionGraph.java:1092)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph$1.lambda$failJob$0(DefaultExecutionGraph.java:477)
 ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460)
 ~[flink-rpc-akka44fe192f-1cb4-4526-9c09-a5d6b8d70d87.jar:1.20-SNAPSHOT]
        at 
org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
 ~[flink-rpc-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
<...> 
{code}
Notably: {{org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint 
tolerable failure threshold. The latest checkpoint failed due to Checkpoint 
expired before completing., view the Checkpoint History tab or the Job Manager 
log to find out why continuous checkpoints failed.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to