[jira] [Updated] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

Till Rohrmann (Jira) Tue, 30 Mar 2021 03:07:04 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Till Rohrmann updated FLINK-22014:
----------------------------------
    Description: 
After the JobManager pod failed and the new one started, it was not able to 
recover jobs due to the absence of recovery data in storage - config map 
pointed at not existing file.
  
 Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
was not able to recover - each attempt failed with the same error so the whole 
cluster became unrecoverable and not operating.
  
 I had to manually delete the config map and start the jobs again without the 
save point.
  
 If I tried to emulate the failure further by deleting job manager pod 
manually, the new pod every time recovered well and issue was not reproducible 
anymore artificially.
  
 Below is the failure log:
{code:java}
2021-03-26 08:22:57,925 INFO 
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
Starting the SlotManager.
 2021-03-26 08:22:57,928 INFO 
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
{configMapName='stellar-flink-cluster-dispatcher-leader'}.
 2021-03-26 08:22:57,931 INFO 
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job ids 
[198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
2021-03-26 08:22:57,933 INFO 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - 
Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
 2021-03-26 08:22:58,029 INFO 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - 
Stopping SessionDispatcherLeaderProcess.
 2021-03-26 08:28:22,677 INFO 
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error occurred 
in the cluster entrypoint. java.util.concurrent.CompletionException: 
org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id 
198c46bac791e73ebcc565a550fa4ff6.
   at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) 
~[?:?]
   at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) 
[?:?]
   at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
[?:?]
   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
   at java.lang.Thread.run(Unknown Source) [?:?] Caused by: 
org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id 
198c46bac791e73ebcc565a550fa4ff6.
   at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more 
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted 
JobGraph from state handle under jobGraph-198c46bac791e73ebcc565a550fa4ff6. 
This indicates that the retrieved state handle is broken. Try cleaning the 
state handle store.
   at 
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:171
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more 
Caused by: java.io.FileNotFoundException: No such file or directory: 
s3a://XXX-flink-state-eu-central-1-live/recovery/YYY-flink-cluster/submittedJobGraph6797768d0737
   at 
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255 
undefined) ~[?:?]
   at 
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149
 undefined) ~[?:?]
   at 
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088 
undefined) ~[?:?]
   at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:699 
undefined) ~[?:?]
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950 undefined) ~[?:?]
   at 
org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:131
 undefined) ~[?:?]
   at 
org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:37
 undefined) ~[?:?]
   at 
org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.open(PluginFileSystemFactory.java:125
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:66
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:162
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
   at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more
{code}

  was:
After the JobManager pod failed and the new one started, it was not able to 
recover jobs due to the absence of recovery data in storage - config map 
pointed at not existing file.
  
 Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
was not able to recover - each attempt failed with the same error so the whole 
cluster became unrecoverable and not operating.
  
 I had to manually delete the config map and start the jobs again without the 
save point.
  
 If I tried to emulate the failure further by deleting job manager pod 
manually, the new pod every time recovered well and issue was not reproducible 
anymore artificially.
  
 Below is the failure log:
{code:java}
2021-03-26 08:22:57,925 INFO 
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
Starting the SlotManager.
 2021-03-26 08:22:57,928 INFO 
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
{configMapName='stellar-flink-cluster-dispatcher-leader'}.
 2021-03-26 08:22:57,931 INFO 
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job ids 
[198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
2021-03-26 08:22:57,933 INFO 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - 
Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
 2021-03-26 08:22:58,029 INFO 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - 
Stopping SessionDispatcherLeaderProcess.
 2021-03-26 08:28:22,677 INFO 
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error occurred 
in the cluster entrypoint. java.util.concurrent.CompletionException: 
org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id 
198c46bac791e73ebcc565a550fa4ff6. at 
java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?] 
at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) 
[?:?] at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
[?:?] at java.lang.Thread.run(Unknown Source) [?:?] Caused by: 
org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id 
198c46bac791e73ebcc565a550fa4ff6. at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more Caused by: 
org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph 
from state handle under jobGraph-198c46bac791e73ebcc565a550fa4ff6. This 
indicates that the retrieved state handle is broken. Try cleaning the state 
handle store. at 
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:171)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more Caused by: 
java.io.FileNotFoundException: No such file or directory: 
s3a://XXX-flink-state-eu-central-1-live/recovery/YYY-flink-cluster/submittedJobGraph6797768d0737
 at 
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255) 
~[?:?] at 
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
 ~[?:?] at 
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088) 
~[?:?] at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:699) 
~[?:?] at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950) ~[?:?] at 
org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:131)
 ~[?:?] at 
org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:37)
 ~[?:?] at 
org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.open(PluginFileSystemFactory.java:125)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:66)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:162)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more
{code}


> Flink JobManager failed to restart after failure in kubernetes HA setup
> -----------------------------------------------------------------------
>
>                 Key: FLINK-22014
>                 URL: https://issues.apache.org/jira/browse/FLINK-22014
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.12.2
>            Reporter: Mikalai Lushchytski
>            Priority: Major
>              Labels: k8s-ha
>
> After the JobManager pod failed and the new one started, it was not able to 
> recover jobs due to the absence of recovery data in storage - config map 
> pointed at not existing file.
>   
>  Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
> was not able to recover - each attempt failed with the same error so the 
> whole cluster became unrecoverable and not operating.
>   
>  I had to manually delete the config map and start the jobs again without the 
> save point.
>   
>  If I tried to emulate the failure further by deleting job manager pod 
> manually, the new pod every time recovered well and issue was not 
> reproducible anymore artificially.
>   
>  Below is the failure log:
> {code:java}
> 2021-03-26 08:22:57,925 INFO 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
> Starting the SlotManager.
>  2021-03-26 08:22:57,928 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
> {configMapName='stellar-flink-cluster-dispatcher-leader'}.
>  2021-03-26 08:22:57,931 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job 
> ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
> 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
> KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
> 2021-03-26 08:22:57,933 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
>  2021-03-26 08:22:58,029 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Stopping SessionDispatcherLeaderProcess.
>  2021-03-26 08:28:22,677 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
> DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint. java.util.concurrent.CompletionException: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>    at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) 
> ~[?:?]
>    at java.util.concurrent.CompletableFuture.completeThrowable(Unknown 
> Source) [?:?]
>    at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
> [?:?]
>    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
>    at java.lang.Thread.run(Unknown Source) [?:?] Caused by: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more 
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted 
> JobGraph from state handle under jobGraph-198c46bac791e73ebcc565a550fa4ff6. 
> This indicates that the retrieved state handle is broken. Try cleaning the 
> state handle store.
>    at 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:171
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more 
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://XXX-flink-state-eu-central-1-live/recovery/YYY-flink-cluster/submittedJobGraph6797768d0737
>    at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255
>  undefined) ~[?:?]
>    at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149
>  undefined) ~[?:?]
>    at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088 
> undefined) ~[?:?]
>    at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:699 
> undefined) ~[?:?]
>    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950 undefined) 
> ~[?:?]
>    at 
> org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:131
>  undefined) ~[?:?]
>    at 
> org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:37
>  undefined) ~[?:?]
>    at 
> org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.open(PluginFileSystemFactory.java:125
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:66
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:162
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

Reply via email to