Yes, the JobManager will failover in HA mode and all jobs would be recovered.
On Mon, Sep 26, 2022 at 2:06 PM ramkrishna vasudevan < ramvasu.fl...@gmail.com> wrote: > Thanks @Matthias Pohl <matthias.p...@aiven.io> . This is informative. So > generally in a session cluster if I have more than one job and only one of > them has this issue, still we will face the same problem? > > Regards > Ram > > On Mon, Sep 26, 2022 at 4:32 PM Matthias Pohl <matthias.p...@aiven.io> > wrote: > >> I see. Thanks for sharing the logs. It's related to a FLINK-9097 [1]. In >> order for the job to not be cleaned up entirely after a failure while >> submitting the job, the JobManager is failed fatally resulting in a >> failover. That's what you're experiencing. >> >> One solution is to fix the permission issue to make the job recover >> without problems. If that's not what you want to do, you could delete the >> entry with the key 'jobGraph-04ae99777ee2ed34c13fe8120e68436e' from the >> JobGraphStore ConfigMap (based on your logs it should >> be flink-972ac3d8028e45fcafa9b8b7b7f1dafb-custer-config-map). This will >> prevent the JobManager from recovering this specific job. Keep in mind that >> you have to clean up any job-related data by yourself in that case. >> >> I hope that helps. >> Matthias >> >> [1] https://issues.apache.org/jira/browse/FLINK-9097 >> >> On Mon, Sep 26, 2022 at 12:26 PM ramkrishna vasudevan < >> ramvasu.fl...@gmail.com> wrote: >> >>> I got some logs and stack traces from our backend storage. This is not >>> the entire log though. Can this be useful? With these set of logs messages >>> the job manager kept restarting. >>> >>> Regards >>> Ram >>> >>> On Mon, Sep 26, 2022 at 3:11 PM ramkrishna vasudevan < >>> ramvasu.fl...@gmail.com> wrote: >>> >>>> Thank you very much for the reply. I have lost the k8s cluster in this >>>> case before I could capture the logs. I will try to repro this and get back >>>> to you. >>>> >>>> Regards >>>> Ram >>>> >>>> On Mon, Sep 26, 2022 at 12:42 PM Matthias Pohl <matthias.p...@aiven.io> >>>> wrote: >>>> >>>>> Hi Ramkrishna, >>>>> thanks for reaching out to the Flink community. Could you share the >>>>> JobManager logs to get a better understanding of what's going on? I'm >>>>> wondering why the JobManager is failing when the actual problem is that >>>>> the >>>>> job is struggling to access a folder. It sounds like there are multiple >>>>> problems here. >>>>> >>>>> Best, >>>>> Matthias >>>>> >>>>> On Mon, Sep 26, 2022 at 6:25 AM ramkrishna vasudevan < >>>>> ramvasu.fl...@gmail.com> wrote: >>>>> >>>>>> Hi all >>>>>> >>>>>> I have a simple job where we read for a given path in cloud storage >>>>>> to watch for new files in a given fodler. While I setup my job there was >>>>>> some permission issue on the folder. The job is STREAMING job. >>>>>> The cluster is set in the session mode and is running on Kubernetes. >>>>>> The job manager since then is failing to come back up and every time >>>>>> it fails with the permission issue. But the point is how should i recover >>>>>> my cluster in this case. Since JM is not there the UI is also not working >>>>>> and how do I remove the bad job from the JM. >>>>>> >>>>>> Regards >>>>>> Ram >>>>>> >>>>>