RE: Re: [DISCUSS] KIP-899: Allow clients to rebootstrap

Ivan Yurchenko Mon, 08 Apr 2024 10:25:01 -0700

Hello!

I changed the KIP a bit, specifying that the certain benefit goes to consumers 
not participating in a group, but that other clients can benefit as well in 
certain situations.


You can see the changes in the history [1]

Thank you!

Ivan

[1] 
https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=240881396&originalVersion=10&revisedVersion=11

On 2023/07/15 16:37:52 Ivan Yurchenko wrote:
> Hello!
> 
> I've made several changes to the KIP based on the comments:
> 
> 1. Reduced the scope to producer and consumer clients only.
> 2. Added more details to the description of the rebootstrap process.
> 3. Documented the role of low values of reconnect.backoff.max.ms in
> preventing rebootstrapping.
> 4. Some wording changes.
> 
> You can see the changes in the history [1]
> 
> I'm planning to put the KIP to a vote in some days if there are no new
> comments.
> 
> Thank you!
> 
> Ivan
> 
> [1]
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=240881396&selectedPageVersions=9&selectedPageVersions=5
> 
> On Tue, 30 May 2023 at 08:23, Ivan Yurchenko <iv...@gmail.com>
> wrote:
> 
> > Hi Chris and all,
> >
> > > I believe the logic you've linked is only applicable for the producer and
> > > consumer clients; the admin client does something different (see [1]).
> >
> > I see, thank you for the pointer. It seems the admin client is fairly
> > different from the producer and consumer. Probably it makes sense to reduce
> > the scope of the KIP to the producer and consumer clients only.
> >
> > > it'd be nice to have a definition of when re-bootstrapping
> > > would occur that doesn't rely on internal implementation details. What
> > > user-visible phenomena can we identify that would lead to a
> > > re-bootstrapping?
> >
> > Let's put it this way: "Re-bootstrapping means that the client forgets
> > about nodes it knows about and falls back on the bootstrap nodes as if it
> > had just been initialized. Re-bootstrapping happens when, during a metadata
> > update (which may be scheduled by `metadata.max.age.ms` or caused by
> > certain error responses like NOT_LEADER_OR_FOLLOWER, REPLICA_NOT_AVAILABLE,
> > etc.), the client doesn't have a node with an established connection or
> > establishable connection."
> > Does this sound good?
> >
> > > I also believe that if someone has "
> > > reconnect.backoff.max.ms" set to a low-enough value,
> > > NetworkClient::leastLoadedNode may never return null. In that case,
> > > shouldn't we still attempt a re-bootstrap at some point (if the user has
> > > enabled this feature)?
> >
> > Yes, you're right. Particularly `canConnect` here [1] can always be
> > returning `true` if `reconnect.backoff.max.ms` is low enough.
> > It seems pretty difficult to find a good criteria when re-bootstrapping
> > should be forced in this case, so it'd be difficult to configure and reason
> > about. I think it's worth mentioning in the KIP and later in the
> > documentation, but we should not try to do anything special here.
> >
> > > Would it make sense to re-bootstrap only after "
> > > metadata.max.age.ms" has elapsed since the last metadata update, and
> > when
> > > at least one request has been made to contact each known server and been
> > > met with failure?
> >
> > The first condition is satisfied by the check in the beginning of
> > `maybeUpdate` [2].
> > It seems to me, the second one is also satisfied by `leastLoadedNode`.
> > Admittedly, it's more relaxed than you propose: it tracks unavailability of
> > nodes that was detected by all types of requests, not only by metadata
> > requests.
> > What do you think, would this be enough?
> >
> > [1]
> > https://github.com/apache/kafka/blob/c9a42c85e2c903329b3550181d230527e90e3646/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L698
> > [2]
> > https://github.com/apache/kafka/blob/c9a42c85e2c903329b3550181d230527e90e3646/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L1034-L1041
> >
> > Best,
> > Ivan
> >
> >
> > On Tue, 21 Feb 2023 at 20:07, Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> >
> >> Hi Ivan,
> >>
> >> I believe the logic you've linked is only applicable for the producer and
> >> consumer clients; the admin client does something different (see [1]).
> >>
> >> Either way, it'd be nice to have a definition of when re-bootstrapping
> >> would occur that doesn't rely on internal implementation details. What
> >> user-visible phenomena can we identify that would lead to a
> >> re-bootstrapping? I also believe that if someone has "
> >> reconnect.backoff.max.ms" set to a low-enough value,
> >> NetworkClient::leastLoadedNode may never return null. In that case,
> >> shouldn't we still attempt a re-bootstrap at some point (if the user has
> >> enabled this feature)? Would it make sense to re-bootstrap only after "
> >> metadata.max.age.ms" has elapsed since the last metadata update, and when
> >> at least one request has been made to contact each known server and been
> >> met with failure?
> >>
> >> [1] -
> >>
> >> https://github.com/apache/kafka/blob/c9a42c85e2c903329b3550181d230527e90e3646/clients/src/main/java/org/apache/kafka/clients/admin/internals/AdminMetadataManager.java#L100
> >>
> >> Cheers,
> >>
> >> Chris
> >>
> >> On Sun, Feb 19, 2023 at 3:39 PM Ivan Yurchenko <iv...@gmail.com>
> >> wrote:
> >>
> >> > Hi Chris,
> >> >
> >> > Thank you for your question. As a part of various lifecycle phases
> >> > (including node disconnect), NetworkClient can request metadata update
> >> > eagerly (the `Metadata.requestUpdate` method), which results in
> >> > `MetadataUpdater.maybeUpdate` being called during next poll. Inside, it
> >> has
> >> > a way to find a known node it can connect to for the fresh metadata. If
> >> no
> >> > such node is found, it backs off. (Code [1]). I'm thinking of
> >> piggybacking
> >> > on this logic and injecting the rebootstrap attempt before the backoff.
> >> >
> >> > As of the second part of you question: the re-bootstrapping means
> >> replacing
> >> > the node addresses in the client with the original bootstrap addresses,
> >> so
> >> > if the first bootstrap attempt fails, the client will continue using the
> >> > bootstrap addresses until success -- pretty much as if it were recreated
> >> > from scratch.
> >> >
> >> > Best,
> >> > Ivan
> >> >
> >> > [1]
> >> >
> >> >
> >> https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L1045-L1049
> >> >
> >> > On Thu, 16 Feb 2023 at 17:18, Chris Egerton <ch...@aiven.io.invalid>
> >> > wrote:
> >> >
> >> > > Hi Ivan,
> >> > >
> >> > > I'm not very familiar with the clients side of things but the proposal
> >> > > seems reasonable.
> >> > >
> >> > > I like the flexibility of the "metadata.recovery.strategy" property
> >> as a
> >> > > string instead of, e.g., a "rebootstrap.enabled" boolean. We may want
> >> to
> >> > > adapt a different approach in the future, like the background thread
> >> > > mentioned in the rejected alternatives section.
> >> > >
> >> > > I also like handling this via configuration property instead of
> >> adding a
> >> > > Java-level API or suggesting that users re-instantiate their clients
> >> > since
> >> > > we may want to enable this new behavior by default in the future, and
> >> it
> >> > > also reduces the level of effort required for users to benefit from
> >> this
> >> > > improvement.
> >> > >
> >> > > One question I have--that may have an obvious answer to anyone more
> >> > > familiar with client internals--is under which conditions we will
> >> > determine
> >> > > a rebootstrap is appropriate. Taking the admin client as an example,
> >> the
> >> > "
> >> > > default.api.timeout.ms" property gives us a limit on the time an
> >> > operation
> >> > > will be allowed to take before it completes or fails (with optional
> >> > > per-request overrides in the various *Options classes), and the "
> >> > > request.timeout.ms" property gives us a limit on the time each
> >> request
> >> > > issued for that operation will be allowed to take before it
> >> completes, is
> >> > > retried, or causes the operation to fail (if no more retries can be
> >> > > performed). If all of the known servers (i.e., bootstrap servers for
> >> the
> >> > > first operation, or discovered brokers if bootstrapping has already
> >> been
> >> > > completed) are unavailable, the admin client will keep (re)trying to
> >> > fetch
> >> > > metadata until the API timeout is exhausted, issuing multiple
> >> requests to
> >> > > the same server if necessary. When would a re-bootstrapping occur
> >> here?
> >> > > Ideally we could find some approach that minimizes false positives
> >> > (where a
> >> > > re-bootstrapping is performed even though the current set of known
> >> > brokers
> >> > > is only temporarily unavailable, as opposed to permanently moved). Of
> >> > > course, given the opt-in nature of the re-bootstrapping feature, we
> >> can
> >> > > always shoot for "good enough" on that front, but, it'd be nice to
> >> > > understand some of the potential pitfalls of enabling it.
> >> > >
> >> > > Following up on the above, would we cache the need to perform a
> >> > > re-bootstrap across separate operations? For example, if I try to
> >> > describe
> >> > > a cluster, then a re-bootstrapping takes place and fails, and then I
> >> try
> >> > to
> >> > > describe the cluster a second time. With that second attempt, would we
> >> > > immediately resort to the bootstrap servers for any initial metadata
> >> > > updates, or would we still try to go through the last-known set of
> >> > brokers
> >> > > first?
> >> > >
> >> > > Cheers,
> >> > >
> >> > > Chris
> >> > >
> >> > > On Mon, Feb 6, 2023 at 4:32 AM Ivan Yurchenko <
> >> ivan0yurche...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Hi!
> >> > > >
> >> > > > There seems to be not much more discussion going, so I'm planning to
> >> > > start
> >> > > > the vote in a couple of days.
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Ivan
> >> > > >
> >> > > > On Wed, 18 Jan 2023 at 12:06, Ivan Yurchenko <
> >> ivan0yurche...@gmail.com
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > > > Hello!
> >> > > > > I would like to start the discussion thread on KIP-899: Allow
> >> clients
> >> > > to
> >> > > > > rebootstrap.
> >> > > > > This KIP proposes to allow Kafka clients to repeat the bootstrap
> >> > > process
> >> > > > > when fetching metadata if none of the known nodes are available.
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-899%3A+Allow+clients+to+rebootstrap
> >> > > > >
> >> > > > > A question right away: should we eventually change the default
> >> > behavior
> >> > > > or
> >> > > > > it can remain configurable "forever"? The latter is proposed in
> >> the
> >> > > KIP.
> >> > > > >
> >> > > > > Thank you!
> >> > > > >
> >> > > > > Ivan
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

RE: Re: [DISCUSS] KIP-899: Allow clients to rebootstrap

Reply via email to