Good to know that you were able to fix the issue!

I definitely agree that it would be good to know why this situation
occurred.

Am Di., 23. Juli 2019 um 14:38 Uhr schrieb Richard Deurwaarder <
rich...@xeli.eu>:

> Hi Fabian,
>
> I followed the advice of another flink user who mailed me directly, he has
> the same problem and told me to use something like: rmr zgrep 
> /flink/hunch/jobgraphs/1dccee15d84e1d2cededf89758ac2482
> which allowed us to start the job again.
>
> It might be nice to investigate what went wrong as it didn't feel good to
> have our production clustered crippled like this.
>
> Richard
>
> On Tue, Jul 23, 2019 at 12:47 PM Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi Richard,
>>
>> I hope you could resolve the problem in the meantime.
>>
>> Nonetheless, maybe Till (in CC) has an idea what could have gone wrong.
>>
>> Best, Fabian
>>
>> Am Mi., 17. Juli 2019 um 19:50 Uhr schrieb Richard Deurwaarder <
>> rich...@xeli.eu>:
>>
>>> Hello,
>>>
>>> I've got a problem with our flink cluster where the jobmanager is not
>>> starting up anymore, because it tries to download non existant (blob) file
>>> from the zookeeper storage dir.
>>>
>>> We're running flink 1.8.0 on a kubernetes cluster and use the google
>>> storage connector [1] to store checkpoints, savepoints and zookeeper data.
>>>
>>> When I noticed the jobmanager was having problems, it was in a crashloop
>>> throwing file not found exceptions [2]
>>> Caused by: java.io.FileNotFoundException: Item not found:
>>> some-project-flink-state/recovery/hunch/blob/job_e6ad857af7f09b56594e95fe273e9eff/blob_p-486d68fa98fa05665f341d79302c40566b81034e-306d493f5aa810b5f4f7d8d63f5b18b5.
>>> If you enabled STRICT generation consistency, it is possible that the live
>>> version is still available but the intended generation is deleted.
>>>
>>> I looked in the blob directory and I can only find:
>>> /recovery/hunch/blob/job_1dccee15d84e1d2cededf89758ac2482 I've tried to
>>> fiddle around in zookeeper to see if I could find anything [3], but I do
>>> not really know what to look for.
>>>
>>> How could this have happened and how should I recover the job from this
>>> situation?
>>>
>>> Thanks,
>>>
>>> Richard
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/connectors.html#using-hadoop-file-system-implementations
>>> [2] https://gist.github.com/Xeli/0321031655e47006f00d38fc4bc08e16
>>> [3] https://gist.github.com/Xeli/04f6d861c5478071521ac6d2c582832a
>>>
>>

Reply via email to