Kubernetes HA having issues when restarting job

Alex Craig Fri, 13 Oct 2023 11:33:13 -0700

My job in Kubernetes periodically fails with the following error:

2023-10-13 18:22:32,153 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
occurred in the cluster entrypoint.
java.util.concurrent.CompletionException: 
org.apache.flink.util.FlinkRuntimeException: Could not retrieve JobResults of 
globally-terminated jobs from JobResultStore
                at 
java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
                at 
java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) [?:?]
                at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) [?:?]
                at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source) [?:?]
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source) [?:?]
                at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: org.apache.flink.util.FlinkRuntimeException: Could not retrieve 
JobResults of globally-terminated jobs from JobResultStore
                at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResults(SessionDispatcherLeaderProcess.java:196)
 ~[flink-dist-1.17.1.jar:1.17.1]
                at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
 ~[flink-dist-1.17.1.jar:1.17.1]
                at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResultsIfRunning(SessionDispatcherLeaderProcess.java:188)
 ~[flink-dist-1.17.1.jar:1.17.1]
                ... 4 more
Caused by: 
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.exc.MismatchedInputException:
 No content to map due to end-of-input
at [Source: (org.apache.flink.fs.azure.common.hadoop.HadoopDataInputStream); 
line: 1, column: 0]
                at 
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:59)
 ~[flink-dist-1.17.1.jar:1.17.1]
                at 
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4765)
 ~[flink-dist-1.17.1.jar:1.17.1]
                at 
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4667)
 ~[flink-dist-1.17.1.jar:1.17.1]
                at 
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3666)
 ~[flink-dist-1.17.1.jar:1.17.1]
                at 
org.apache.flink.runtime.highavailability.FileSystemJobResultStore.getDirtyResultsInternal(FileSystemJobResultStore.java:208)
 ~[flink-dist-1.17.1.jar:1.17.1]
                at 
org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.withReadLock(AbstractThreadsafeJobResultStore.java:118)
 ~[flink-dist-1.17.1.jar:1.17.1]
                at 
org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.getDirtyResults(AbstractThreadsafeJobResultStore.java:100)
 ~[flink-dist-1.17.1.jar:1.17.1]
                at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResults(SessionDispatcherLeaderProcess.java:194)
 ~[flink-dist-1.17.1.jar:1.17.1]
                at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
 ~[flink-dist-1.17.1.jar:1.17.1]
                at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResultsIfRunning(SessionDispatcherLeaderProcess.java:188)
 ~[flink-dist-1.17.1.jar:1.17.1]
                ... 4 more


This is similar to another error in the mailing lists but is distinct because 
it isn’t about not finding a specific file or directory 
(https://lists.apache.org/[email protected]:2022-2:Could%20not%20retrieve%20JobResults%20of%20globally-terminated%20jobs%20from%20JobResultStore).

I have HA configured and pointed at ADLS2. The problem only arises once in a 
while. The job can succeed if I update anything in its configuration. But if I 
do not update anything, it just sits dead, attempting to restart the job but 
seemingly recovering some sort of bad job state. I am on Flink 1.17.1 using 
native FlinkDeployments with the Kubernetes Operator.

Kubernetes HA having issues when restarting job

Reply via email to