Good to know that you were able to fix the issue! I definitely agree that it would be good to know why this situation occurred.
Am Di., 23. Juli 2019 um 14:38 Uhr schrieb Richard Deurwaarder < rich...@xeli.eu>: > Hi Fabian, > > I followed the advice of another flink user who mailed me directly, he has > the same problem and told me to use something like: rmr zgrep > /flink/hunch/jobgraphs/1dccee15d84e1d2cededf89758ac2482 > which allowed us to start the job again. > > It might be nice to investigate what went wrong as it didn't feel good to > have our production clustered crippled like this. > > Richard > > On Tue, Jul 23, 2019 at 12:47 PM Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Richard, >> >> I hope you could resolve the problem in the meantime. >> >> Nonetheless, maybe Till (in CC) has an idea what could have gone wrong. >> >> Best, Fabian >> >> Am Mi., 17. Juli 2019 um 19:50 Uhr schrieb Richard Deurwaarder < >> rich...@xeli.eu>: >> >>> Hello, >>> >>> I've got a problem with our flink cluster where the jobmanager is not >>> starting up anymore, because it tries to download non existant (blob) file >>> from the zookeeper storage dir. >>> >>> We're running flink 1.8.0 on a kubernetes cluster and use the google >>> storage connector [1] to store checkpoints, savepoints and zookeeper data. >>> >>> When I noticed the jobmanager was having problems, it was in a crashloop >>> throwing file not found exceptions [2] >>> Caused by: java.io.FileNotFoundException: Item not found: >>> some-project-flink-state/recovery/hunch/blob/job_e6ad857af7f09b56594e95fe273e9eff/blob_p-486d68fa98fa05665f341d79302c40566b81034e-306d493f5aa810b5f4f7d8d63f5b18b5. >>> If you enabled STRICT generation consistency, it is possible that the live >>> version is still available but the intended generation is deleted. >>> >>> I looked in the blob directory and I can only find: >>> /recovery/hunch/blob/job_1dccee15d84e1d2cededf89758ac2482 I've tried to >>> fiddle around in zookeeper to see if I could find anything [3], but I do >>> not really know what to look for. >>> >>> How could this have happened and how should I recover the job from this >>> situation? >>> >>> Thanks, >>> >>> Richard >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/connectors.html#using-hadoop-file-system-implementations >>> [2] https://gist.github.com/Xeli/0321031655e47006f00d38fc4bc08e16 >>> [3] https://gist.github.com/Xeli/04f6d861c5478071521ac6d2c582832a >>> >>