The reason why the running jobs try to failover with zookeeper outage is that the JobManager lost leadership. Having a standby JobManager or not makes no difference.
Best, Yang Matthias Pohl via user <user@flink.apache.org> 于2023年1月2日周一 20:51写道: > And I screwed up the reply again. -.- Here's my previous response for the > ML thread and not only spoon_lz: > > Hi spoon_lz, > Thanks for reaching out to the community and sharing your use case. You're > right about the fact that Flink's HA feature relies on the leader election. > The HA backend not being responsive for too long might cause problems. I'm > not sure I understand fully what you mean by the standby JobManagers > struggling with the ZK outage shouldn't affect the running jobs. If ZK is > not responding for the standby JMs, the actual JM leader should be affected > as well which, as a consequence, would affect the job execution. But I > might misunderstand your post. Logs would be helpful to get a better > understanding of your post's context. > > Best, > Matthias > > FYI: There is also (a kind of stalled) discussion in the dev ML [1] about > recovery of too many jobs affecting Flink's performance. > > [1] https://lists.apache.org/thread/r3fnw13j5h04z87lb34l42nvob4pq2xj > > On Thu, Dec 29, 2022 at 8:55 AM spoon_lz <spoon...@126.com> wrote: > >> Hi All, >> We use zookeeper to achieve high availability of jobs. Recently, a >> failure occurred in our flink cluster. It was due to the abnormal downtime >> of the zookeeper service that all the flink jobs using this zookeeper all >> occurred failover. The failover startup of a large number of jobs in a >> short period of time caused the cluster The pressure is too high, which in >> turn causes the cluster to crash. >> Afterwards, I checked the HA function of zk: >> 1. Leader election >> 2. Service discovery >> 3.State persistence: >> >> The unavailability of the zookeeper service leads to failover of the >> flink job. It seems that because of the first point, JM cannot confirm >> whether it is Active or Standby, and the other two points should not affect >> it. But we didn't use the Standby JobManager. >> So in my opinion, if the JobManager of Standby is not used, whether the >> zk service is available should not affect the jobs that are running >> normally(of course, it is understandable that the task cannot be recovered >> correctly if an exception occurs), and I don’t know if there is a way to >> achieve a similar purpose >> >