I up PR 15441 <https://github.com/apache/kafka/pull/15441> with the
immediate fix for the KafkaAdminClient to help get this conversation
started. I would appreciate some feedback on that.

Thanks,
Chris

On Fri, Feb 23, 2024 at 3:41 PM Chris Wildman <cwild...@newrelic.com> wrote:

> Hi All,
>
> I recently discovered the race condition where Kafka clients may request
> metadata from brokers who have not yet received a snapshot of the
> cluster metadata. At scale with ZK managed clusters we see this happen
> frequently when brokers are restarted, possibly due to the admin client's
> preference for the least loaded node. This results in unexpected behavior
> for some metadata requests on the Admin api. For example describeTopics
> will fail with an UnknownTopicOrPartition exception for topics which do
> exist in the cluster.
>
> I also learned that consumers and producers ignore uninitialized metadata
> by detecting if the set of brokers in a metadata response is empty:
> https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L1187-L1191
>
> The only Admin API that has this protection seems to be listConsumerGroups
> which throws a StaleMetadataException for the empty brokers condition:
> https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java#L3369-L3371
>
> I'm interested in making it clearer to users of the Admin api when they
> should retry a metadata request because it was handled by a broker with
> uninitialized metadata. I think the proper fix here would be to have
> brokers respond with a UninitializedMetadataException when handling
> metadata requests if they haven't yet received a metadata snapshot. That is
> a big change that would need to be handled appropriately in all clients. A
> more immediate fix would be to change the KafkaAdminClient to *always*
> detect the empty brokers condition when getting a MetadataResponse and
> throw a StaleMetadataException or some other RetriableException.
>
> Some questions I have:
>
> 1. Does the likelihood of a broker responding with stale metadata decrease
> significantly or entirely when using KRAFT? I can understand not fixing
> this if that is the case. I tried but could not reproduce this behavior
> using kafka integration tests for both ZK and KRAFT.
>
> 2. Do we want to go for the proper fix or the more immediate one?
>
> 3. Would the immediate fix mentioned above, patching the KafkaAdminClient,
> require a KIP or should I just PR that?
>
> 4. Is StaleMetadataException the exception we want to use for the
> unitialized metadata case? From the docs for both StaleMetadataException
> and InvalidMetadataException it seems more geared toward old data, not
> uninitialized data.
>
> Thanks for your time and I hope this is an appropriate discussion!
>
> Chris Wildman
>

Reply via email to