Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

tison Thu, 22 Apr 2021 10:04:11 -0700

> My question is can we get some insight behind this decision and could we
add
some tunable configuration for user to decide how long they can endure such
uncertain suspended state in their jobs.


For the specific question, Curator provides a configure for session timeout
and a
LOST will be generated if disconnected elapsed longer then the configured
timeout.

https://github.com/apache/flink/blob/58a7c80fa35424608ad44d1d6691d1407be0092a/flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java#L101-L102


Best,
tison.


tison <wander4...@gmail.com> 于2021年4月23日周五 上午12:57写道：

> To be concrete, if ZK suspended and reconnected, NodeCache already do
> the reset work for you and if there is a leader epoch updated, fencing
> token
> a.k.a leader session id would be updated so you will notice it.
>
> If ZK permanently lost, I think it is a system-wise fault and you'd better
> restart
> the job from checkpoint/savepoint with a working ZK ensemble.
>
> I am possibly concluding without more detailed investigation though.
>
> Best,
> tison.
>
>
> tison <wander4...@gmail.com> 于2021年4月23日周五 上午12:35写道：
>
>> > Unfortunately, we do not have any progress on this ticket.
>>
>> Here is a PR[1].
>>
>> Here is the base PR[2] I made about one year ago without following review.
>>
>> qinnc...@gmail.com:
>>
>> It requires further investigation about the impact involved by
>> FLINK-18677[3].
>> I do have some comments[4] but so far regard it as a stability problem
>> instead of
>> correctness problem.
>>
>> FLINK-18677 tries to "fix" an unreasonable scenario where zk lost FOREVER,
>> and I don't want to pay any time before reactions on FLINK-10052 otherwise
>> it is highly possibly in vain again from my perspective.
>>
>> Best,
>> tison.
>>
>> [1] https://github.com/apache/flink/pull/15675
>> [2] https://github.com/apache/flink/pull/11338
>> [3] https://issues.apache.org/jira/browse/FLINK-18677
>> [4] https://github.com/apache/flink/pull/13055#discussion_r615871963
>>
>>
>>
>> Chen Qin <qinnc...@gmail.com> 于2021年4月23日周五 上午12:15写道：
>>
>>> Hi there,
>>>
>>> Quick dial back here, we have been running load testing and so far
>>> haven't
>>> seen suspended state cause job restarts.
>>>
>>> Some findings, instead of curator framework capture suspended state and
>>> active notify leader lost, we have seen task manager propagate unhandled
>>> errors from zk client, most likely due to
>>> high-availability.zookeeper.client.max-retry-attempts
>>> were set to 3 and with 5 seconds interval. It would be great if we handle
>>> this exception gracefully with a meaningful exception message. Those
>>> error
>>> messages happen when other task managers die due to user code exceptions,
>>> we would like to know more insights on this as well.
>>>
>>> For more context, Lu from our team also filed [2] stating issue with 1.9,
>>> so far we haven't seen regression on ongoing load testing jobs.
>>>
>>> Thanks,
>>> Chen
>>>
>>> Caused by:
>>> >
>>> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
>>> > KeeperErrorCode = ConnectionLoss
>>> > at
>>> >
>>> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>>> > at
>>> >
>>> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
>>>
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-10052
>>> [2] https://issues.apache.org/jira/browse/FLINK-19985
>>>
>>>
>>> On Thu, Apr 15, 2021 at 7:27 PM Yang Wang <danrtsey...@gmail.com> wrote:
>>>
>>> > Thanks for trying the unfinished PR and sharing the testing results.
>>> Glad
>>> > to here that it could work
>>> > and really hope the result of more stringent load testing.
>>> >
>>> > After then I think we could revive this ticket.
>>> >
>>> >
>>> > Best,
>>> > Yang
>>> >
>>> > Chen Qin <qinnc...@gmail.com> 于2021年4月16日周五 上午2:01写道：
>>> >
>>> >> Hi there,
>>> >>
>>> >> Thanks for providing points to related changes and jira. Some updates
>>> >> from our side, we applied a path by merging FLINK-10052
>>> >> <https://issues.apache.org/jira/browse/FLINK-10052> with master as
>>> well
>>> >> as only handling lost state leveraging
>>> SessionConnectionStateErrorPolicy
>>> >>   FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052>
>>> >>  introduced.
>>> >>
>>> >> Preliminary results were good, the same workload (240 TM) on the same
>>> >> environment runs stable without frequent restarts due to suspended
>>> state
>>> >> (seems false positive). We are working on more stringent load testing
>>> as
>>> >> well as chaos testing (blocking zk). Will keep folks posted.
>>> >>
>>> >> Thanks,
>>> >> Chen
>>> >>
>>> >>
>>> >> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <trohrm...@apache.org>
>>> >> wrote:
>>> >>
>>> >>> Hi Chenqin,
>>> >>>
>>> >>> The current rationale behind assuming a leadership loss when seeing a
>>> >>> SUSPENDED connection is to assume the worst and to be on the safe
>>> side.
>>> >>>
>>> >>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the
>>> behaviour
>>> >>> configurable. Unfortunately, the community did not have enough time
>>> to
>>> >>> complete this feature.
>>> >>>
>>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
>>> >>>
>>> >>> Cheers,
>>> >>> Till
>>> >>>
>>> >>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <danrtsey...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> > This might be related with FLINK-10052[1].
>>> >>> > Unfortunately, we do not have any progress on this ticket.
>>> >>> >
>>> >>> > cc @Till Rohrmann <trohrm...@apache.org>
>>> >>> >
>>> >>> > Best,
>>> >>> > Yang
>>> >>> >
>>> >>> > chenqin <qinnc...@gmail.com> 于2021年4月13日周二 上午7:31写道：
>>> >>> >
>>> >>> >> Hi there,
>>> >>> >>
>>> >>> >> We observed several 1.11 job running in 1.11 restart due to job
>>> leader
>>> >>> >> lost.
>>> >>> >> Dig deeper, the issue seems related to SUSPENDED state handler in
>>> >>> >> ZooKeeperLeaderRetrievalService.
>>> >>> >>
>>> >>> >> ASFAIK, suspended state is expected when zk is not certain if
>>> leader
>>> >>> is
>>> >>> >> still alive. It can follow up with RECONNECT or LOST. In current
>>> >>> >> implementation [1] , we treat suspended state same as lost state
>>> and
>>> >>> >> actively shutdown job. This pose stability issue on large HA
>>> setting.
>>> >>> >>
>>> >>> >> My question is can we get some insight behind this decision and
>>> could
>>> >>> we
>>> >>> >> add
>>> >>> >> some tunable configuration for user to decide how long they can
>>> endure
>>> >>> >> such
>>> >>> >> uncertain suspended state in their jobs.
>>> >>> >>
>>> >>> >> Thanks,
>>> >>> >> Chen
>>> >>> >>
>>> >>> >> [1]
>>> >>> >>
>>> >>> >>
>>> >>>
>>> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >> --
>>> >>> >> Sent from:
>>> >>> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>>> >>> >>
>>> >>> >
>>> >>>
>>> >>
>>>
>>

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

Reply via email to