Hi Ivan, I believe the logic you've linked is only applicable for the producer and consumer clients; the admin client does something different (see [1]).
Either way, it'd be nice to have a definition of when re-bootstrapping would occur that doesn't rely on internal implementation details. What user-visible phenomena can we identify that would lead to a re-bootstrapping? I also believe that if someone has " reconnect.backoff.max.ms" set to a low-enough value, NetworkClient::leastLoadedNode may never return null. In that case, shouldn't we still attempt a re-bootstrap at some point (if the user has enabled this feature)? Would it make sense to re-bootstrap only after " metadata.max.age.ms" has elapsed since the last metadata update, and when at least one request has been made to contact each known server and been met with failure? [1] - https://github.com/apache/kafka/blob/c9a42c85e2c903329b3550181d230527e90e3646/clients/src/main/java/org/apache/kafka/clients/admin/internals/AdminMetadataManager.java#L100 Cheers, Chris On Sun, Feb 19, 2023 at 3:39 PM Ivan Yurchenko <ivan0yurche...@gmail.com> wrote: > Hi Chris, > > Thank you for your question. As a part of various lifecycle phases > (including node disconnect), NetworkClient can request metadata update > eagerly (the `Metadata.requestUpdate` method), which results in > `MetadataUpdater.maybeUpdate` being called during next poll. Inside, it has > a way to find a known node it can connect to for the fresh metadata. If no > such node is found, it backs off. (Code [1]). I'm thinking of piggybacking > on this logic and injecting the rebootstrap attempt before the backoff. > > As of the second part of you question: the re-bootstrapping means replacing > the node addresses in the client with the original bootstrap addresses, so > if the first bootstrap attempt fails, the client will continue using the > bootstrap addresses until success -- pretty much as if it were recreated > from scratch. > > Best, > Ivan > > [1] > > https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L1045-L1049 > > On Thu, 16 Feb 2023 at 17:18, Chris Egerton <chr...@aiven.io.invalid> > wrote: > > > Hi Ivan, > > > > I'm not very familiar with the clients side of things but the proposal > > seems reasonable. > > > > I like the flexibility of the "metadata.recovery.strategy" property as a > > string instead of, e.g., a "rebootstrap.enabled" boolean. We may want to > > adapt a different approach in the future, like the background thread > > mentioned in the rejected alternatives section. > > > > I also like handling this via configuration property instead of adding a > > Java-level API or suggesting that users re-instantiate their clients > since > > we may want to enable this new behavior by default in the future, and it > > also reduces the level of effort required for users to benefit from this > > improvement. > > > > One question I have--that may have an obvious answer to anyone more > > familiar with client internals--is under which conditions we will > determine > > a rebootstrap is appropriate. Taking the admin client as an example, the > " > > default.api.timeout.ms" property gives us a limit on the time an > operation > > will be allowed to take before it completes or fails (with optional > > per-request overrides in the various *Options classes), and the " > > request.timeout.ms" property gives us a limit on the time each request > > issued for that operation will be allowed to take before it completes, is > > retried, or causes the operation to fail (if no more retries can be > > performed). If all of the known servers (i.e., bootstrap servers for the > > first operation, or discovered brokers if bootstrapping has already been > > completed) are unavailable, the admin client will keep (re)trying to > fetch > > metadata until the API timeout is exhausted, issuing multiple requests to > > the same server if necessary. When would a re-bootstrapping occur here? > > Ideally we could find some approach that minimizes false positives > (where a > > re-bootstrapping is performed even though the current set of known > brokers > > is only temporarily unavailable, as opposed to permanently moved). Of > > course, given the opt-in nature of the re-bootstrapping feature, we can > > always shoot for "good enough" on that front, but, it'd be nice to > > understand some of the potential pitfalls of enabling it. > > > > Following up on the above, would we cache the need to perform a > > re-bootstrap across separate operations? For example, if I try to > describe > > a cluster, then a re-bootstrapping takes place and fails, and then I try > to > > describe the cluster a second time. With that second attempt, would we > > immediately resort to the bootstrap servers for any initial metadata > > updates, or would we still try to go through the last-known set of > brokers > > first? > > > > Cheers, > > > > Chris > > > > On Mon, Feb 6, 2023 at 4:32 AM Ivan Yurchenko <ivan0yurche...@gmail.com> > > wrote: > > > > > Hi! > > > > > > There seems to be not much more discussion going, so I'm planning to > > start > > > the vote in a couple of days. > > > > > > Thanks, > > > > > > Ivan > > > > > > On Wed, 18 Jan 2023 at 12:06, Ivan Yurchenko <ivan0yurche...@gmail.com > > > > > wrote: > > > > > > > Hello! > > > > I would like to start the discussion thread on KIP-899: Allow clients > > to > > > > rebootstrap. > > > > This KIP proposes to allow Kafka clients to repeat the bootstrap > > process > > > > when fetching metadata if none of the known nodes are available. > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-899%3A+Allow+clients+to+rebootstrap > > > > > > > > A question right away: should we eventually change the default > behavior > > > or > > > > it can remain configurable "forever"? The latter is proposed in the > > KIP. > > > > > > > > Thank you! > > > > > > > > Ivan > > > > > > > > > > > > > > > > > >