subject:"How to gracefully handle job recovery failures"

Re: How to gracefully handle job recovery failures

2021-06-15 Thread Li Peng

Understood, thanks all! -Li On Fri, Jun 11, 2021 at 12:40 AM Till Rohrmann wrote: > Hi Li, > > Roman is right about Flink's behavior and what you can do about it. The > idea behind its current behavior is the following: If Flink cannot recover > a job, it is very hard for it to tell whether it

Re: How to gracefully handle job recovery failures

2021-06-11 Thread Till Rohrmann

Hi Li, Roman is right about Flink's behavior and what you can do about it. The idea behind its current behavior is the following: If Flink cannot recover a job, it is very hard for it to tell whether it is due to an intermittent problem or a permanent one. No matter how often you retry, you can al

Re: How to gracefully handle job recovery failures

2021-06-10 Thread Roman Khachatryan

Hi Li, If I understand correctly, you want the cluster to proceed recovery, skipping some non-recoverable jobs (but still recover others). The only way I can think of is to remove the corresponding nodes in ZooKeeper which is not very safe. I'm pulling in Robert and Till who might know better. R

Re: How to gracefully handle job recovery failures

2021-06-10 Thread Li Peng

Hi Roman, Is there a way to abandon job recovery after a few tries? By that I mean that this problem was fixed by me restarting the cluster and not try to recover a job. Is there some setting that emulates what I did, so I don't need to do manual intervention if this happens again?? Thanks, Li O

Re: How to gracefully handle job recovery failures

2021-06-10 Thread Roman Khachatryan

Hi Li, The missing file is a serialized job graph and the job recovery can't proceed without it. Unfortunately, the cluster can't proceed if one of the jobs can't recover. Regards, Roman On Thu, Jun 10, 2021 at 6:02 AM Li Peng wrote: > > Hey folks, we have a cluster with HA mode enabled, and re

How to gracefully handle job recovery failures

2021-06-09 Thread Li Peng

Hey folks, we have a cluster with HA mode enabled, and recently after doing a zookeeper restart, our Kafka cluster (Flink v. 1.11.3, Scala v. 2.12) crashed and was stuck in a crash loop, with the following error: 2021-06-10 02:14:52.123 [cluster-io-thread-1] ERROR org.apache.flink.runtime.entrypoi

Re: How to gracefully handle job recovery failures

Re: How to gracefully handle job recovery failures

Re: How to gracefully handle job recovery failures

Re: How to gracefully handle job recovery failures

Re: How to gracefully handle job recovery failures

How to gracefully handle job recovery failures

6 matches

Site Navigation

Mail list logo

Footer information