My job in Kubernetes periodically fails with the following error:
2023-10-13 18:22:32,153 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error
occurred in the cluster entrypoint.
java.util.concurrent.CompletionException:
org.apache.flink.util.FlinkRuntimeException: Could not retrieve JobResults of
globally-terminated jobs from JobResultStore
at
java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
at
java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) [?:?]
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source) [?:?]
at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: org.apache.flink.util.FlinkRuntimeException: Could not retrieve
JobResults of globally-terminated jobs from JobResultStore
at
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResults(SessionDispatcherLeaderProcess.java:196)
~[flink-dist-1.17.1.jar:1.17.1]
at
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
~[flink-dist-1.17.1.jar:1.17.1]
at
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResultsIfRunning(SessionDispatcherLeaderProcess.java:188)
~[flink-dist-1.17.1.jar:1.17.1]
... 4 more
Caused by:
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.exc.MismatchedInputException:
No content to map due to end-of-input
at [Source: (org.apache.flink.fs.azure.common.hadoop.HadoopDataInputStream);
line: 1, column: 0]
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:59)
~[flink-dist-1.17.1.jar:1.17.1]
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4765)
~[flink-dist-1.17.1.jar:1.17.1]
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4667)
~[flink-dist-1.17.1.jar:1.17.1]
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3666)
~[flink-dist-1.17.1.jar:1.17.1]
at
org.apache.flink.runtime.highavailability.FileSystemJobResultStore.getDirtyResultsInternal(FileSystemJobResultStore.java:208)
~[flink-dist-1.17.1.jar:1.17.1]
at
org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.withReadLock(AbstractThreadsafeJobResultStore.java:118)
~[flink-dist-1.17.1.jar:1.17.1]
at
org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.getDirtyResults(AbstractThreadsafeJobResultStore.java:100)
~[flink-dist-1.17.1.jar:1.17.1]
at
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResults(SessionDispatcherLeaderProcess.java:194)
~[flink-dist-1.17.1.jar:1.17.1]
at
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
~[flink-dist-1.17.1.jar:1.17.1]
at
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResultsIfRunning(SessionDispatcherLeaderProcess.java:188)
~[flink-dist-1.17.1.jar:1.17.1]
... 4 more
This is similar to another error in the mailing lists but is distinct because
it isn’t about not finding a specific file or directory
(https://lists.apache.org/[email protected]:2022-2:Could%20not%20retrieve%20JobResults%20of%20globally-terminated%20jobs%20from%20JobResultStore).
I have HA configured and pointed at ADLS2. The problem only arises once in a
while. The job can succeed if I update anything in its configuration. But if I
do not update anything, it just sits dead, attempting to restart the job but
seemingly recovering some sort of bad job state. I am on Flink 1.17.1 using
native FlinkDeployments with the Kubernetes Operator.