Hey folks,
We experienced a pipeline failure where our job manager restarted and we were for some reason unable to restore from our last successful checkpoint. We had regularly completed checkpoints every 10 minutes up to this failure and 0 failed checkpoints logged. Using Flink version 1.17.1. Wondering if anyone can shed light on what might have happened? Here's the error from our logs: Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught exception. Stopping the process... extendedStackTrace: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: java.lang.NullPointerException at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669) ~[a-pipeline-name.jar:1.0] at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) ~[?:?] at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?] at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) [?:?] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910) [?:?] at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) [?:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] Caused by: java.util.concurrent.CompletionException: java.lang.NullPointerException at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) ~[?:?] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) ~[?:?] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932) ~[?:?] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[?:?] ... 7 more Caused by: java.lang.NullPointerException at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:399) ~[a-pipeline-name.jar:1.0] at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?] at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085) ~[?:?] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:947) ~[a-pipeline-name.jar:1.0] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:923) ~[a-pipeline-name.jar:1.0] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:655) ~[a-pipeline-name.jar:1.0] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ~[?:?] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[?:?] ... 7 more