Re: [DISCUSS] KIP-380: Detect outdated control requests and bounced brokers using broker generation

Patrick Huang Mon, 22 Oct 2018 16:32:55 -0700

Hi Jun,

That is a good point. I want to make it clear about the scenario you mentioned. 
Correct me if I am wrong. Here is the sequence of the event:


  1.  Broker sends ControlledShutdown request 1 to controller
  2.  Broker sends ControlledShutdown request 2 to controller due to reties
  3.  Controller processes ControlledShutdown request 1
  4.  Controller sends control requests to the broker
  5.  Broker receives ControlledShutdown response 1 from controller
  6.  Broker shuts down and restarts quickly
  7.  Controller processes ControllerShutdown request 2
  8.  Controller sends control requests to the broker

It is possible that controller processes the broker change event between 6) and 
7). In this case, controller already updates the cached czxid to the up-to-date 
ones so the bounced broker will not reject control requests in 8), which cause 
a correctness problem.


Best,
Zhanxiang (Patrick) Huang

________________________________
From: Jun Rao <[email protected]>
Sent: Monday, October 22, 2018 14:45
To: dev
Subject: Re: [DISCUSS] KIP-380: Detect outdated control requests and bounced 
brokers using broker generation

Hi, Patrick,

There is another thing that may be worth considering.

10. It will be useful to include the czxid also in the ControlledShutdown
request. This way, if the broker has been restarted, the controller can
ignore an old ControlledShutdown request(e.g., due to retries). This will
prevent the restarted broker from incorrectly stopping replicas.

Thanks,

Jun


On Thu, Oct 11, 2018 at 5:56 PM, Patrick Huang <[email protected]>
wrote:

> Hi Jun,
>
> Thanks a lot for the comments.
>
> 1. czxid is globally unique and monotonically increasing based on the
> zookeeper doc.
> References (from
> https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html):
> "Every change to the ZooKeeper state receives a stamp in the form of a
> *zxid* (ZooKeeper Transaction Id). This exposes the total ordering of all
> changes to ZooKeeper. Each change will have a unique zxid and if zxid1 is
> smaller than zxid2 then zxid1 happened before zxid2."
> "czxid: The zxid of the change that caused this znode to be created."
>
> 2. You are right. There will be only on broker change event fired in the
> case I mentioned because we will not register the watcher before the read.
>
> 3. Let's say we want to initialize the states of broker set A and we want
> the cluster to be aware of the absence of broker set B. The currently live
> broker set in the cluster is C.
>
>     From the design point of view, here are the rules for broker state
> transition:
>     - Pass in broker ids of A for onBrokerStartup() and pass in broker ids
> of B for onBrokerFailure().
>     - When processing onBrokerStartup(), we use the broker generation
> controller read from zk to send requests to broker set A and use the
> previously cached broker generation to send requests to (C-A).
>     - When processing onBrokerFailure(), we use the previously cached
> broker generation to send requests to C.
>
>     From the implementation point of view, here are the steps we need to
> follow when processing BrokerChangeEvent:
>     -  Reads all child nodes in /brokers/ids/ to get current brokers with
> broker generation
>     -  Detect new brokers, dead brokers and bounced brokers
>     -  Update the live broker ids in controller context
>     -  Update broker generations for new brokers in controller context
>     -  Invoke onBrokerStartup(new brokers)
>     -  Invoke onBrokerFailure(bounced brokers)
>     -  Update broker generations for bounce brokers in controller context
>     -  Invoke onBrokerStartup(bounced brokers)
>     -  Invoke onBrokerFailure(dead brokers)
>     We can further optimize the flow by avoiding sending requests to a
> broker if its broker generation is larger than the one in the controller
> context.
>
> I will also update the KIP to clarify how it works for BrokerChangeEvent
> processing in more detail.
>
> Thanks,
> Patrick
>
>
>
> On Thu, Oct 11, 2018 at 12:12 PM Jun Rao <[email protected]> wrote:
>
> > Hi, Patrick,
> >
> > Thanks for the KIP. Looks good to me overall and very useful. A few
> > comments below.
> >
> > 1. "will reject the requests with smaller broker generation than its
> > current generation." Is czxid monotonically increasing?
> >
> > 2. To clarify on the issue of the controller missing a ZK watcher. ZK
> > watchers are one-time watchers. Once a watcher is fired, one needs to
> > register it again before the watcher can be triggered. So, when the
> > controller is busy and a broker goes down and comes up, the first event
> > will trigger the ZK watcher. Since the controller is busy and hasn't
> > registered the watcher again, the second event actually won't fire. By
> the
> > time the controller reads from ZK, it sees that the broker is still
> > registered and thus thinks that nothing has happened to the broker, which
> > is causing the problem.
> >
> > 3. "Handle broker state change: invoke onBrokerFailure(...) first, then
> > invoke onBrokerStartUp(...)". We probably want to be a bit careful here.
> > Could you clarify the broker list and the broker epoch used when making
> > these calls? We want to prevent the restarted broker from receiving a
> > partial replica list on the first LeaderAndIsr request because of this.
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Oct 10, 2018 at 12:51 PM, Patrick Huang <[email protected]>
> > wrote:
> >
> > > Hey Stanislav,
> > >
> > > Sure. Thanks for your interest in this KIP. I am glad to provide more
> > > detail.
> > >
> > > broker A is initiating a controlled shutdown (restart). The Controller
> > > sends a StopReplicaRequest but it reaches broker A after it has started
> > up
> > > again. He therefore stops replicating those partitions even though he
> > > should just be starting to
> > > This is right.
> > >
> > > Controller sends a LeaderAndIsrRequest before broker A initiates a
> > restart.
> > > Broker A restarts and receives the LeaderAndIsrRequest then. It
> therefore
> > > starts leading for the partitions sent by that request and might stop
> > > leading partitions that it was leading previously.
> > > This was well explained in the linked JIRA, but I cannot understand why
> > > that would happen due to my limited experience. If Broker A leads p1
> and
> > > p2, when would a Controller send a LeaderAndIsrRequest with p1 only and
> > not
> > > want Broker A to drop leadership for p2?
> > > The root cause of the issue is that after a broker just restarts, it
> > > relies on the first LeaderAndIsrRequest to populate the partition state
> > and
> > > initializes the highwater mark checkpoint thread. The highwater mark
> > > checkpoint thread will overwrite the highwater mark checkpoint file
> based
> > > on the broker's in-memory partition states. In other words, If a
> > partition
> > > that is physically hosted by the broker is missing in the in-memory
> > > partition states map, its highwater mark will be lost after the
> highwater
> > > mark checkpoint thread overwrites the file. (Related codes:
> > > https://github.com/apache/kafka/blob/ed3bd79633ae227ad995dafc3d9f38
> > > 4a5534d4e9/core/src/main/scala/kafka/server/
> ReplicaManager.scala#L1091)
> > > [https://avatars3.githubusercontent.com/u/47359?s=400&v=4]<
> > > https://github.com/apache/kafka/blob/ed3bd79633ae227ad995dafc3d9f38
> > > 4a5534d4e9/core/src/main/scala/kafka/server/
> ReplicaManager.scala#L1091>
> > >
> > > apache/kafka<https://github.com/apache/kafka/blob/
> > >
> > ed3bd79633ae227ad995dafc3d9f384a5534d4e9/core/src/main/
> scala/kafka/server/
> > > ReplicaManager.scala#L1091>
> > > Mirror of Apache Kafka. Contribute to apache/kafka development by
> > creating
> > > an account on GitHub.
> > > github.com
> > >
> > >
> > > In your example, assume the first LeaderAndIsrRequest broker A receives
> > is
> > > the one initiated in the controlled shutdown logic in Controller to
> move
> > > leadership away from broker A. This LeaderAndIsrRequest only contains
> > > partitions that broker A leads, not all the partitions that broker A
> > hosts
> > > (i.e. no follower partitions), so the highwater mark for the follower
> > > partitions will be lost. Also, the first LeaderAndIsrRequst broker A
> > > receives may not necessarily be the one initiated in controlled
> shutdown
> > > logic (e.g. there can be an ongoing preferred leader election),
> although
> > I
> > > think this may not be very common.
> > >
> > > Here the controller will start processing the BrokerChange event (that
> > says
> > > that broker A shutdown) after the broker has come back up and
> > re-registered
> > > himself in ZK?
> > > How will the Controller miss the restart, won't he subsequently receive
> > > another ZK event saying that broker A has come back up?
> > > Controller will not miss the BrokerChange event and actually there will
> > be
> > > two BrokerChange events fired in this case (one for broker
> deregistration
> > > in zk and one for registration). However, when processing the
> > > BrokerChangeEvent, controller needs to do a read from zookeeper to get
> > back
> > > the current brokers in the cluster and if the bounced broker already
> > joined
> > > the cluster by this time, controller will not know this broker has been
> > > bounced because it sees no diff between zk and its in-memory cache. So
> > > basically both of the BrokerChange event processing become no-op.
> > >
> > >
> > > Hope that I answer your questions. Feel free to follow up if I am
> missing
> > > something.
> > >
> > >
> > > Thanks,
> > > Zhanxiang (Patrick) Huang
> > >
> > > ________________________________
> > > From: Stanislav Kozlovski <[email protected]>
> > > Sent: Wednesday, October 10, 2018 7:22
> > > To: [email protected]
> > > Subject: Re: [DISCUSS] KIP-380: Detect outdated control requests and
> > > bounced brokers using broker generation
> > >
> > > Hi Patrick,
> > >
> > > Thanks for the KIP! Fixing such correctness issues is always very
> > welcome -
> > > they're commonly hard to diagnose and debug when they happen in
> > production.
> > >
> > > I was wondering if I understood the potential correctness issues
> > correctly.
> > > Here is what I got:
> > >
> > >
> > >    - If a broker bounces during controlled shutdown, the bounced broker
> > may
> > >    accidentally process its earlier generation’s StopReplicaRequest
> sent
> > > from
> > >    the active controller for one of its follower replicas, leaving the
> > > replica
> > >    offline while its remaining replicas may stay online
> > >
> > > broker A is initiating a controlled shutdown (restart). The Controller
> > > sends a StopReplicaRequest but it reaches broker A after it has started
> > up
> > > again. He therefore stops replicating those partitions even though he
> > > should just be starting to
> > >
> > >
> > >    - If the first LeaderAndIsrRequest that a broker processes is sent
> by
> > >    the active controller before its startup, the broker will overwrite
> > the
> > >    high watermark checkpoint file and may cause incorrect truncation (
> > >    KAFKA-7235 <https://issues.apache.org/jira/browse/KAFKA-7235>)
> > >
> > > Controller sends a LeaderAndIsrRequest before broker A initiates a
> > restart.
> > > Broker A restarts and receives the LeaderAndIsrRequest then. It
> therefore
> > > starts leading for the partitions sent by that request and might stop
> > > leading partitions that it was leading previously.
> > > This was well explained in the linked JIRA, but I cannot understand why
> > > that would happen due to my limited experience. If Broker A leads p1
> and
> > > p2, when would a Controller send a LeaderAndIsrRequest with p1 only and
> > not
> > > want Broker A to drop leadership for p2?
> > >
> > >
> > >    - If a broker bounces very quickly, the controller may start
> > processing
> > >    the BrokerChange event after the broker already re-registers itself
> in
> > > zk.
> > >    In this case, controller will miss the broker restart and will not
> > send
> > > any
> > >    requests to the broker for initialization. The broker will not be
> able
> > > to
> > >    accept traffics.
> > >
> > > Here the controller will start processing the BrokerChange event (that
> > says
> > > that broker A shutdown) after the broker has come back up and
> > re-registered
> > > himself in ZK?
> > > How will the Controller miss the restart, won't he subsequently receive
> > > another ZK event saying that broker A has come back up?
> > >
> > >
> > > Could we explain these potential problems in a bit more detail just so
> > they
> > > could be more easily digestable by novices?
> > >
> > > Thanks,
> > > Stanislav
> > >
> > > On Wed, Oct 10, 2018 at 9:21 AM Dong Lin <[email protected]> wrote:
> > >
> > > > Hey Patrick,
> > > >
> > > > Thanks much for the KIP. The KIP is very well written.
> > > >
> > > > LGTM.  +1 (binding)
> > > >
> > > > Thanks,
> > > > Dong
> > > >
> > > >
> > > > On Tue, Oct 9, 2018 at 11:46 PM Patrick Huang <[email protected]>
> > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > Please find the below KIP which proposes the concept of broker
> > > generation
> > > > > to resolve issues caused by controller missing broker state changes
> > and
> > > > > broker processing outdated control requests.
> > > > >
> > > > >
> > > > >
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 380%3A+Detect+outdated+control+requests+and+bounced+
> brokers+using+broker+
> > > generation
> > > > >
> > > > > All comments are appreciated.
> > > > >
> > > > > Best,
> > > > > Zhanxiang (Patrick) Huang
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Stanislav
> > >
> >
>

Re: [DISCUSS] KIP-380: Detect outdated control requests and bounced brokers using broker generation

Reply via email to