Re: Jobmanager fails to come up if the job has an issue

Matthias Pohl via user Mon, 26 Sep 2022 05:26:17 -0700

Yes, the JobManager will failover in HA mode and all jobs would be
recovered.


On Mon, Sep 26, 2022 at 2:06 PM ramkrishna vasudevan <
ramvasu.fl...@gmail.com> wrote:

> Thanks @Matthias Pohl <matthias.p...@aiven.io> . This is informative.  So
> generally in a session cluster if I have more than one job and only one of
> them has this issue, still we will face the same problem?
>
> Regards
> Ram
>
> On Mon, Sep 26, 2022 at 4:32 PM Matthias Pohl <matthias.p...@aiven.io>
> wrote:
>
>> I see. Thanks for sharing the logs. It's related to a FLINK-9097 [1]. In
>> order for the job to not be cleaned up entirely after a failure while
>> submitting the job, the JobManager is failed fatally resulting in a
>> failover. That's what you're experiencing.
>>
>> One solution is to fix the permission issue to make the job recover
>> without problems. If that's not what you want to do, you could delete the
>> entry with the key 'jobGraph-04ae99777ee2ed34c13fe8120e68436e' from the
>> JobGraphStore ConfigMap (based on your logs it should
>> be flink-972ac3d8028e45fcafa9b8b7b7f1dafb-custer-config-map). This will
>> prevent the JobManager from recovering this specific job. Keep in mind that
>> you have to clean up any job-related data by yourself in that case.
>>
>> I hope that helps.
>> Matthias
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-9097
>>
>> On Mon, Sep 26, 2022 at 12:26 PM ramkrishna vasudevan <
>> ramvasu.fl...@gmail.com> wrote:
>>
>>> I got some logs and stack traces from our backend storage. This is not
>>> the entire log though. Can this be useful?  With these set of logs messages
>>> the job manager kept restarting.
>>>
>>> Regards
>>> Ram
>>>
>>> On Mon, Sep 26, 2022 at 3:11 PM ramkrishna vasudevan <
>>> ramvasu.fl...@gmail.com> wrote:
>>>
>>>> Thank you very much for the reply. I have lost the k8s cluster in this
>>>> case before I could capture the logs. I will try to repro this and get back
>>>> to you.
>>>>
>>>> Regards
>>>> Ram
>>>>
>>>> On Mon, Sep 26, 2022 at 12:42 PM Matthias Pohl <matthias.p...@aiven.io>
>>>> wrote:
>>>>
>>>>> Hi Ramkrishna,
>>>>> thanks for reaching out to the Flink community. Could you share the
>>>>> JobManager logs to get a better understanding of what's going on? I'm
>>>>> wondering why the JobManager is failing when the actual problem is that 
>>>>> the
>>>>> job is struggling to access a folder. It sounds like there are multiple
>>>>> problems here.
>>>>>
>>>>> Best,
>>>>> Matthias
>>>>>
>>>>> On Mon, Sep 26, 2022 at 6:25 AM ramkrishna vasudevan <
>>>>> ramvasu.fl...@gmail.com> wrote:
>>>>>
>>>>>> Hi all
>>>>>>
>>>>>> I have a simple job where we read for a given path in cloud storage
>>>>>> to watch for new files in a given fodler. While I setup my job there was
>>>>>> some permission issue on the folder. The job is STREAMING job.
>>>>>> The cluster is set in the session mode and is running on Kubernetes.
>>>>>> The job manager since then is failing to come back up and every time
>>>>>> it fails with the permission issue. But the point is how should i recover
>>>>>> my cluster in this case. Since JM is not there the UI is also not working
>>>>>> and how do I remove the bad job from the JM.
>>>>>>
>>>>>> Regards
>>>>>> Ram
>>>>>>
>>>>>

Re: Jobmanager fails to come up if the job has an issue

Reply via email to