[
https://issues.apache.org/jira/browse/FLINK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936547#comment-17936547
]
Max Feng commented on FLINK-37483:
----------------------------------
Here's the trace of the failure. I'll need to reproduce it again to get full
logs from the attempt.
{code:java}
Job 00000000000000000000000000000000 reached terminal state FAILED.
org.apache.flink.runtime.client.JobInitializationException: Could not start the
JobMaster.
at
org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97)
at
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
at
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
at
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1773)
at
org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:67)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.util.concurrent.CompletionException:
java.lang.IllegalStateException: There is no operator for the state
ad8761465be643c10db5fae153b87f68
at
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
at
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1770)
... 4 more
Caused by: java.lang.IllegalStateException: There is no operator for the state
ad8761465be643c10db5fae153b87f68
at
org.apache.flink.runtime.checkpoint.StateAssignmentOperation.checkStateMappingCompleteness(StateAssignmentOperation.java:769)
at
org.apache.flink.runtime.checkpoint.StateAssignmentOperation.assignStates(StateAssignmentOperation.java:101)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1829)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreInitialCheckpointIfPresent(CheckpointCoordinator.java:1749)
at
org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:210)
at
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:382)
at
org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:225)
at
org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:142)
at
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:162)
at
org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:121)
at
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:406)
at
org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:383)
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:128)
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:100)
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
... 4 more
{code}
> Native kubernetes clusters losing checkpoint state on FAILED
> ------------------------------------------------------------
>
> Key: FLINK-37483
> URL: https://issues.apache.org/jira/browse/FLINK-37483
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.20.1
> Reporter: Max Feng
> Priority: Major
>
> We're running Flink 1.20, native kubernetes application-mode clusters, and
> we're running into an issue where clusters are restarting without checkpoints
> from HA configmaps.
> To the best of our understanding, here's what's happening:
> 1) We're running application-mode clusters in native kubernetes with
> externalized checkpoints, retained on cancellation. We're attempting to
> restore a job from a checkpoint; the checkpoint reference is held in the
> Kubernetes HA configmap.
> 2) The jobmanager encounters an issue during startup, and the job goes to
> state FAILED.
> 3) The HA configmap containing the checkpoint reference is cleaned up.
> 4) The Kubernetes pod exits. Because it is a Kubernetes deployment, the pod
> is immediately restarted.
> 5) Upon restart, the new Jobmanager finds no checkpoints to restore from.
> We think this is a bad combination of the following behaviors:
> * FAILED triggers cleanup, which cleans up HA configmaps in native kubernetes
> mode
> * FAILED does not actually stop a job in native kubernetes mode, instead it
> is immediately retried
--
This message was sent by Atlassian Jira
(v8.20.10#820010)