I up PR 15441 <https://github.com/apache/kafka/pull/15441> with the immediate fix for the KafkaAdminClient to help get this conversation started. I would appreciate some feedback on that.
Thanks, Chris On Fri, Feb 23, 2024 at 3:41 PM Chris Wildman <cwild...@newrelic.com> wrote: > Hi All, > > I recently discovered the race condition where Kafka clients may request > metadata from brokers who have not yet received a snapshot of the > cluster metadata. At scale with ZK managed clusters we see this happen > frequently when brokers are restarted, possibly due to the admin client's > preference for the least loaded node. This results in unexpected behavior > for some metadata requests on the Admin api. For example describeTopics > will fail with an UnknownTopicOrPartition exception for topics which do > exist in the cluster. > > I also learned that consumers and producers ignore uninitialized metadata > by detecting if the set of brokers in a metadata response is empty: > https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L1187-L1191 > > The only Admin API that has this protection seems to be listConsumerGroups > which throws a StaleMetadataException for the empty brokers condition: > https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java#L3369-L3371 > > I'm interested in making it clearer to users of the Admin api when they > should retry a metadata request because it was handled by a broker with > uninitialized metadata. I think the proper fix here would be to have > brokers respond with a UninitializedMetadataException when handling > metadata requests if they haven't yet received a metadata snapshot. That is > a big change that would need to be handled appropriately in all clients. A > more immediate fix would be to change the KafkaAdminClient to *always* > detect the empty brokers condition when getting a MetadataResponse and > throw a StaleMetadataException or some other RetriableException. > > Some questions I have: > > 1. Does the likelihood of a broker responding with stale metadata decrease > significantly or entirely when using KRAFT? I can understand not fixing > this if that is the case. I tried but could not reproduce this behavior > using kafka integration tests for both ZK and KRAFT. > > 2. Do we want to go for the proper fix or the more immediate one? > > 3. Would the immediate fix mentioned above, patching the KafkaAdminClient, > require a KIP or should I just PR that? > > 4. Is StaleMetadataException the exception we want to use for the > unitialized metadata case? From the docs for both StaleMetadataException > and InvalidMetadataException it seems more geared toward old data, not > uninitialized data. > > Thanks for your time and I hope this is an appropriate discussion! > > Chris Wildman >