Ryan Skraba created FLINK-35499:
-----------------------------------
Summary: EventTimeWindowCheckpointingITCase times out due to
Checkpoint expired before completing
Key: FLINK-35499
URL: https://issues.apache.org/jira/browse/FLINK-35499
Project: Flink
Issue Type: Bug
Affects Versions: 1.20.0
Reporter: Ryan Skraba
* 1.20 AdaptiveScheduler / Test (module: tests)
https://github.com/apache/flink/actions/runs/9311892945/job/25632037990#step:10:8702
* 1.20 Default (Java 8) / Test (module: tests)
https://github.com/apache/flink/actions/runs/9275522134/job/25520829730#step:10:8264
Going into the logs, we see the following error occurs:
{code:java}
================================================================================
Test testTumblingTimeWindow[statebackend type =ROCKSDB_INCREMENTAL,
buffersPerChannel =
2](org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase) is
running.
--------------------------------------------------------------------------------
<...>
20:24:23,562 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 22 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863562 for job
15d0a663cb415b09b9a68ccc40640c6d.
20:24:23,609 [jobmanager-io-thread-2] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 22 for job 15d0a663cb415b09b9a68ccc40640c6d (2349132 bytes,
checkpointDuration=43 ms, finalizationTime=4 ms).
20:24:23,610 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 23 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863610 for job
15d0a663cb415b09b9a68ccc40640c6d.
20:24:23,620 [jobmanager-io-thread-2] WARN
org.apache.flink.runtime.jobmaster.JobMaster [] - Error while
processing AcknowledgeCheckpoint message
java.lang.IllegalStateException: Attempt to reference unknown state:
a9a90973-4ee5-384f-acef-58a7c7560920
at
org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.state.SharedStateRegistryImpl.registerReference(SharedStateRegistryImpl.java:97)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.state.SharedStateRegistry.registerReference(SharedStateRegistry.java:53)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.state.IncrementalRemoteKeyedStateHandle.registerSharedStates(IncrementalRemoteKeyedStateHandle.java:289)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedState(OperatorSubtaskState.java:243)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedStates(OperatorSubtaskState.java:226)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.TaskStateSnapshot.registerSharedStates(TaskStateSnapshot.java:193)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1245)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$2(ExecutionGraphHandler.java:109)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$4(ExecutionGraphHandler.java:139)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64)
~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_392]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_392]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392]
20:24:23,663 [Source: Custom Source (1/1)#1] INFO
org.apache.flink.runtime.taskmanager.Task [] - Source:
Custom Source (1/1)#1
(bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1)
switched from RUNNING to FINISHED.
20:24:23,663 [Source: Custom Source (1/1)#1] INFO
org.apache.flink.runtime.taskmanager.Task [] - Freeing task
resources for Source: Custom Source (1/1)#1
(bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1).
20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] -
Un-registering task and sending final execution state FINISHED to JobManager
for task Source: Custom Source (1/1)#1
bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1.
20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source:
Custom Source (1/1)
(bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1)
switched from RUNNING to FINISHED.
20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Triggering a
manual checkpoint for job 15d0a663cb415b09b9a68ccc40640c6d.
20:24:23,664 [TumblingEventTimeWindows (3/4)#1] INFO
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend [] - Closed
RocksDB State Backend. Cleaning up RocksDB working directory
/tmp/junit4600599408056781950/junit4118530363439384731/job_15d0a663cb415b09b9a68ccc40640c6d_op_WindowOperator_0a448493b4782967b150582570326227__3_4__uuid_f9b935a1-38cc-4cb3-92f0-97bf257d499d.
20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO
org.apache.flink.runtime.taskmanager.Task [] -
TumblingEventTimeWindows (3/4)#1
(bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1)
switched from RUNNING to FINISHED.
20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO
org.apache.flink.runtime.taskmanager.Task [] - Freeing task
resources for TumblingEventTimeWindows (3/4)#1
(bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1).
20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] -
Un-registering task and sending final execution state FINISHED to JobManager
for task TumblingEventTimeWindows (3/4)#1
bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1.
20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
TumblingEventTimeWindows (3/4)
(bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1)
switched from RUNNING to FINISHED.
20:24:23,665 [flink-pekko.actor.default-dispatcher-10] INFO
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager []
- Received resource requirements from job 15d0a663cb415b09b9a68ccc40640c6d:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=3}]
20:26:01,189 [flink-pekko.actor.default-dispatcher-16] INFO
org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] -
Releasing slot [27b0503b2930b17041ffd3635dfb819f].
20:26:01,190 [flink-pekko.actor.default-dispatcher-16] INFO
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot
TaskSlot(index:3, state:ACTIVE, resource profile:
ResourceProfile{taskHeapMemory=512.000gb (549755813888 bytes),
taskOffHeapMemory=512.000gb (549755813888 bytes), managedMemory=64.000mb
(67108864 bytes), networkMemory=32.000mb (33554432 bytes)}, allocationId:
27b0503b2930b17041ffd3635dfb819f, jobId: 15d0a663cb415b09b9a68ccc40640c6d).
20:26:01,205 [flink-pekko.actor.default-dispatcher-16] INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer []
- Freeing slot 27b0503b2930b17041ffd3635dfb819f.
20:34:23,610 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint 23
of job 15d0a663cb415b09b9a68ccc40640c6d expired before completing.
20:34:23,611 [ Checkpoint Timer] WARN
org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to
trigger or complete checkpoint 23 for job 15d0a663cb415b09b9a68ccc40640c6d. (0
consecutive failed attempts so far)
org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint expired
before completing.
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64)
~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[?:1.8.0_392]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[?:1.8.0_392]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[?:1.8.0_392]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[?:1.8.0_392]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_392]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_392]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392]
20:34:23,611 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointRequestDecider [] - checkpoint
request time in queue: 599948
20:34:23,612 [flink-pekko.actor.default-dispatcher-130] INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to
recover from a global failure.
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
failure threshold. The latest checkpoint failed due to Checkpoint expired
before completing., view the Checkpoint History tab or the Job Manager log to
find out why continuous checkpoints failed.
at
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.checkFailureAgainstCounter(CheckpointFailureManager.java:212)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:169)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:122)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2281)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2260)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$1200(CheckpointCoordinator.java:102)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64)
~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
~[?:1.8.0_392]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
~[?:1.8.0_392]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
~[?:1.8.0_392]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.8.0_392]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[?:1.8.0_392]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
~[?:1.8.0_392]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392]
20:34:23,613 [flink-pekko.actor.default-dispatcher-130] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job Tumbling
Window Test (15d0a663cb415b09b9a68ccc40640c6d) switched from state RUNNING to
FAILING.
org.apache.flink.runtime.JobException: Recovery is suppressed by
FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=1,
backoffTimeMS=0)
at
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:219)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailureAndReport(ExecutionFailureHandler.java:166)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.getGlobalFailureHandlingResult(ExecutionFailureHandler.java:140)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:324)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyGlobalFailure(UpdateSchedulerNgOnInternalFailuresListener.java:57)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.failGlobal(DefaultExecutionGraph.java:1092)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph$1.lambda$failJob$0(DefaultExecutionGraph.java:477)
~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460)
~[flink-rpc-akka44fe192f-1cb4-4526-9c09-a5d6b8d70d87.jar:1.20-SNAPSHOT]
at
org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
~[flink-rpc-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
<...>
{code}
Notably: {{org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
tolerable failure threshold. The latest checkpoint failed due to Checkpoint
expired before completing., view the Checkpoint History tab or the Job Manager
log to find out why continuous checkpoints failed.}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)