The reason why the running jobs try to failover with zookeeper outage is
that the JobManager lost leadership.
Having a standby JobManager or not makes no difference.

Best,
Yang

Matthias Pohl via user <user@flink.apache.org> 于2023年1月2日周一 20:51写道:

> And I screwed up the reply again. -.- Here's my previous response for the
> ML thread and not only spoon_lz:
>
> Hi spoon_lz,
> Thanks for reaching out to the community and sharing your use case. You're
> right about the fact that Flink's HA feature relies on the leader election.
> The HA backend not being responsive for too long might cause problems. I'm
> not sure I understand fully what you mean by the standby JobManagers
> struggling with the ZK outage shouldn't affect the running jobs. If ZK is
> not responding for the standby JMs, the actual JM leader should be affected
> as well which, as a consequence, would affect the job execution. But I
> might misunderstand your post. Logs would be helpful to get a better
> understanding of your post's context.
>
> Best,
> Matthias
>
> FYI: There is also (a kind of stalled) discussion in the dev ML [1] about
> recovery of too many jobs affecting Flink's performance.
>
> [1] https://lists.apache.org/thread/r3fnw13j5h04z87lb34l42nvob4pq2xj
>
> On Thu, Dec 29, 2022 at 8:55 AM spoon_lz <spoon...@126.com> wrote:
>
>> Hi All,
>> We use zookeeper to achieve high availability of jobs. Recently, a
>> failure occurred in our flink cluster. It was due to the abnormal downtime
>> of the zookeeper service that all the flink jobs using this zookeeper all
>> occurred failover. The failover startup of a large number of jobs in a
>> short period of time caused the cluster The pressure is too high, which in
>> turn causes the cluster to crash.
>> Afterwards, I checked the HA function of zk:
>> 1. Leader election
>> 2. Service discovery
>> 3.State persistence:
>>
>> The unavailability of the zookeeper service leads to failover of the
>> flink job. It seems that because of the first point, JM cannot confirm
>> whether it is Active or Standby, and the other two points should not affect
>> it. But we didn't use the Standby JobManager.
>> So in my opinion, if the JobManager of Standby is not used, whether the
>> zk service is available should not affect the jobs that are running
>> normally(of course, it is understandable that the task cannot be recovered
>> correctly if an exception occurs), and I don’t know if there is a way to
>> achieve a similar purpose
>>
>

Reply via email to