Hello! I changed the KIP a bit, specifying that the certain benefit goes to consumers not participating in a group, but that other clients can benefit as well in certain situations.
You can see the changes in the history [1] Thank you! Ivan [1] https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=240881396&originalVersion=10&revisedVersion=11 On 2023/07/15 16:37:52 Ivan Yurchenko wrote: > Hello! > > I've made several changes to the KIP based on the comments: > > 1. Reduced the scope to producer and consumer clients only. > 2. Added more details to the description of the rebootstrap process. > 3. Documented the role of low values of reconnect.backoff.max.ms in > preventing rebootstrapping. > 4. Some wording changes. > > You can see the changes in the history [1] > > I'm planning to put the KIP to a vote in some days if there are no new > comments. > > Thank you! > > Ivan > > [1] > https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=240881396&selectedPageVersions=9&selectedPageVersions=5 > > On Tue, 30 May 2023 at 08:23, Ivan Yurchenko <iv...@gmail.com> > wrote: > > > Hi Chris and all, > > > > > I believe the logic you've linked is only applicable for the producer and > > > consumer clients; the admin client does something different (see [1]). > > > > I see, thank you for the pointer. It seems the admin client is fairly > > different from the producer and consumer. Probably it makes sense to reduce > > the scope of the KIP to the producer and consumer clients only. > > > > > it'd be nice to have a definition of when re-bootstrapping > > > would occur that doesn't rely on internal implementation details. What > > > user-visible phenomena can we identify that would lead to a > > > re-bootstrapping? > > > > Let's put it this way: "Re-bootstrapping means that the client forgets > > about nodes it knows about and falls back on the bootstrap nodes as if it > > had just been initialized. Re-bootstrapping happens when, during a metadata > > update (which may be scheduled by `metadata.max.age.ms` or caused by > > certain error responses like NOT_LEADER_OR_FOLLOWER, REPLICA_NOT_AVAILABLE, > > etc.), the client doesn't have a node with an established connection or > > establishable connection." > > Does this sound good? > > > > > I also believe that if someone has " > > > reconnect.backoff.max.ms" set to a low-enough value, > > > NetworkClient::leastLoadedNode may never return null. In that case, > > > shouldn't we still attempt a re-bootstrap at some point (if the user has > > > enabled this feature)? > > > > Yes, you're right. Particularly `canConnect` here [1] can always be > > returning `true` if `reconnect.backoff.max.ms` is low enough. > > It seems pretty difficult to find a good criteria when re-bootstrapping > > should be forced in this case, so it'd be difficult to configure and reason > > about. I think it's worth mentioning in the KIP and later in the > > documentation, but we should not try to do anything special here. > > > > > Would it make sense to re-bootstrap only after " > > > metadata.max.age.ms" has elapsed since the last metadata update, and > > when > > > at least one request has been made to contact each known server and been > > > met with failure? > > > > The first condition is satisfied by the check in the beginning of > > `maybeUpdate` [2]. > > It seems to me, the second one is also satisfied by `leastLoadedNode`. > > Admittedly, it's more relaxed than you propose: it tracks unavailability of > > nodes that was detected by all types of requests, not only by metadata > > requests. > > What do you think, would this be enough? > > > > [1] > > https://github.com/apache/kafka/blob/c9a42c85e2c903329b3550181d230527e90e3646/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L698 > > [2] > > https://github.com/apache/kafka/blob/c9a42c85e2c903329b3550181d230527e90e3646/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L1034-L1041 > > > > Best, > > Ivan > > > > > > On Tue, 21 Feb 2023 at 20:07, Chris Egerton <ch...@aiven.io.invalid> > > wrote: > > > >> Hi Ivan, > >> > >> I believe the logic you've linked is only applicable for the producer and > >> consumer clients; the admin client does something different (see [1]). > >> > >> Either way, it'd be nice to have a definition of when re-bootstrapping > >> would occur that doesn't rely on internal implementation details. What > >> user-visible phenomena can we identify that would lead to a > >> re-bootstrapping? I also believe that if someone has " > >> reconnect.backoff.max.ms" set to a low-enough value, > >> NetworkClient::leastLoadedNode may never return null. In that case, > >> shouldn't we still attempt a re-bootstrap at some point (if the user has > >> enabled this feature)? Would it make sense to re-bootstrap only after " > >> metadata.max.age.ms" has elapsed since the last metadata update, and when > >> at least one request has been made to contact each known server and been > >> met with failure? > >> > >> [1] - > >> > >> https://github.com/apache/kafka/blob/c9a42c85e2c903329b3550181d230527e90e3646/clients/src/main/java/org/apache/kafka/clients/admin/internals/AdminMetadataManager.java#L100 > >> > >> Cheers, > >> > >> Chris > >> > >> On Sun, Feb 19, 2023 at 3:39 PM Ivan Yurchenko <iv...@gmail.com> > >> wrote: > >> > >> > Hi Chris, > >> > > >> > Thank you for your question. As a part of various lifecycle phases > >> > (including node disconnect), NetworkClient can request metadata update > >> > eagerly (the `Metadata.requestUpdate` method), which results in > >> > `MetadataUpdater.maybeUpdate` being called during next poll. Inside, it > >> has > >> > a way to find a known node it can connect to for the fresh metadata. If > >> no > >> > such node is found, it backs off. (Code [1]). I'm thinking of > >> piggybacking > >> > on this logic and injecting the rebootstrap attempt before the backoff. > >> > > >> > As of the second part of you question: the re-bootstrapping means > >> replacing > >> > the node addresses in the client with the original bootstrap addresses, > >> so > >> > if the first bootstrap attempt fails, the client will continue using the > >> > bootstrap addresses until success -- pretty much as if it were recreated > >> > from scratch. > >> > > >> > Best, > >> > Ivan > >> > > >> > [1] > >> > > >> > > >> https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L1045-L1049 > >> > > >> > On Thu, 16 Feb 2023 at 17:18, Chris Egerton <ch...@aiven.io.invalid> > >> > wrote: > >> > > >> > > Hi Ivan, > >> > > > >> > > I'm not very familiar with the clients side of things but the proposal > >> > > seems reasonable. > >> > > > >> > > I like the flexibility of the "metadata.recovery.strategy" property > >> as a > >> > > string instead of, e.g., a "rebootstrap.enabled" boolean. We may want > >> to > >> > > adapt a different approach in the future, like the background thread > >> > > mentioned in the rejected alternatives section. > >> > > > >> > > I also like handling this via configuration property instead of > >> adding a > >> > > Java-level API or suggesting that users re-instantiate their clients > >> > since > >> > > we may want to enable this new behavior by default in the future, and > >> it > >> > > also reduces the level of effort required for users to benefit from > >> this > >> > > improvement. > >> > > > >> > > One question I have--that may have an obvious answer to anyone more > >> > > familiar with client internals--is under which conditions we will > >> > determine > >> > > a rebootstrap is appropriate. Taking the admin client as an example, > >> the > >> > " > >> > > default.api.timeout.ms" property gives us a limit on the time an > >> > operation > >> > > will be allowed to take before it completes or fails (with optional > >> > > per-request overrides in the various *Options classes), and the " > >> > > request.timeout.ms" property gives us a limit on the time each > >> request > >> > > issued for that operation will be allowed to take before it > >> completes, is > >> > > retried, or causes the operation to fail (if no more retries can be > >> > > performed). If all of the known servers (i.e., bootstrap servers for > >> the > >> > > first operation, or discovered brokers if bootstrapping has already > >> been > >> > > completed) are unavailable, the admin client will keep (re)trying to > >> > fetch > >> > > metadata until the API timeout is exhausted, issuing multiple > >> requests to > >> > > the same server if necessary. When would a re-bootstrapping occur > >> here? > >> > > Ideally we could find some approach that minimizes false positives > >> > (where a > >> > > re-bootstrapping is performed even though the current set of known > >> > brokers > >> > > is only temporarily unavailable, as opposed to permanently moved). Of > >> > > course, given the opt-in nature of the re-bootstrapping feature, we > >> can > >> > > always shoot for "good enough" on that front, but, it'd be nice to > >> > > understand some of the potential pitfalls of enabling it. > >> > > > >> > > Following up on the above, would we cache the need to perform a > >> > > re-bootstrap across separate operations? For example, if I try to > >> > describe > >> > > a cluster, then a re-bootstrapping takes place and fails, and then I > >> try > >> > to > >> > > describe the cluster a second time. With that second attempt, would we > >> > > immediately resort to the bootstrap servers for any initial metadata > >> > > updates, or would we still try to go through the last-known set of > >> > brokers > >> > > first? > >> > > > >> > > Cheers, > >> > > > >> > > Chris > >> > > > >> > > On Mon, Feb 6, 2023 at 4:32 AM Ivan Yurchenko < > >> ivan0yurche...@gmail.com> > >> > > wrote: > >> > > > >> > > > Hi! > >> > > > > >> > > > There seems to be not much more discussion going, so I'm planning to > >> > > start > >> > > > the vote in a couple of days. > >> > > > > >> > > > Thanks, > >> > > > > >> > > > Ivan > >> > > > > >> > > > On Wed, 18 Jan 2023 at 12:06, Ivan Yurchenko < > >> ivan0yurche...@gmail.com > >> > > > >> > > > wrote: > >> > > > > >> > > > > Hello! > >> > > > > I would like to start the discussion thread on KIP-899: Allow > >> clients > >> > > to > >> > > > > rebootstrap. > >> > > > > This KIP proposes to allow Kafka clients to repeat the bootstrap > >> > > process > >> > > > > when fetching metadata if none of the known nodes are available. > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-899%3A+Allow+clients+to+rebootstrap > >> > > > > > >> > > > > A question right away: should we eventually change the default > >> > behavior > >> > > > or > >> > > > > it can remain configurable "forever"? The latter is proposed in > >> the > >> > > KIP. > >> > > > > > >> > > > > Thank you! > >> > > > > > >> > > > > Ivan > >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >