And I screwed up the reply again. -.- Here's my previous response for the
ML thread and not only spoon_lz:

Hi spoon_lz,
Thanks for reaching out to the community and sharing your use case. You're
right about the fact that Flink's HA feature relies on the leader election.
The HA backend not being responsive for too long might cause problems. I'm
not sure I understand fully what you mean by the standby JobManagers
struggling with the ZK outage shouldn't affect the running jobs. If ZK is
not responding for the standby JMs, the actual JM leader should be affected
as well which, as a consequence, would affect the job execution. But I
might misunderstand your post. Logs would be helpful to get a better
understanding of your post's context.

Best,
Matthias

FYI: There is also (a kind of stalled) discussion in the dev ML [1] about
recovery of too many jobs affecting Flink's performance.

[1] https://lists.apache.org/thread/r3fnw13j5h04z87lb34l42nvob4pq2xj

On Thu, Dec 29, 2022 at 8:55 AM spoon_lz <spoon...@126.com> wrote:

> Hi All,
> We use zookeeper to achieve high availability of jobs. Recently, a failure
> occurred in our flink cluster. It was due to the abnormal downtime of the
> zookeeper service that all the flink jobs using this zookeeper all occurred
> failover. The failover startup of a large number of jobs in a short period
> of time caused the cluster The pressure is too high, which in turn causes
> the cluster to crash.
> Afterwards, I checked the HA function of zk:
> 1. Leader election
> 2. Service discovery
> 3.State persistence:
>
> The unavailability of the zookeeper service leads to failover of the flink
> job. It seems that because of the first point, JM cannot confirm whether it
> is Active or Standby, and the other two points should not affect it. But we
> didn't use the Standby JobManager.
> So in my opinion, if the JobManager of Standby is not used, whether the zk
> service is available should not affect the jobs that are running
> normally(of course, it is understandable that the task cannot be recovered
> correctly if an exception occurs), and I don’t know if there is a way to
> achieve a similar purpose
>

Reply via email to