Jason Gustafson created KAFKA-9261:
--------------------------------------

             Summary: NPE when updating client metadata
                 Key: KAFKA-9261
                 URL: https://issues.apache.org/jira/browse/KAFKA-9261
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson


We have seen the following exception recently:

{code}
java.lang.NullPointerException
        at java.base/java.util.Objects.requireNonNull(Objects.java:221)
        at org.apache.kafka.common.Cluster.<init>(Cluster.java:134)
        at org.apache.kafka.common.Cluster.<init>(Cluster.java:89)
        at 
org.apache.kafka.clients.MetadataCache.computeClusterView(MetadataCache.java:120)
        at org.apache.kafka.clients.MetadataCache.<init>(MetadataCache.java:82)
        at org.apache.kafka.clients.MetadataCache.<init>(MetadataCache.java:58)
        at 
org.apache.kafka.clients.Metadata.handleMetadataResponse(Metadata.java:325)
        at org.apache.kafka.clients.Metadata.update(Metadata.java:252)
        at 
org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1059)
        at 
org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:845)
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:548)
        at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
        at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1281)
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1225)
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1201)
{code}

The client assumes that if a leader is included in the response, then node 
information must also be available. There are at least a couple possible 
reasons this assumption can fail:

1. The client is able to detect stale partition metadata using leader epoch 
information available. If stale partition metadata is detected, the client 
ignores it and uses the last known metadata. However, it cannot detect stale 
broker information and will always accept the latest update. This means that 
the latest metadata may be a mix of multiple metadata responses and therefore 
the invariant will not generally hold.
2. There is no lock which protects both the fetching of partition metadata and 
the live broker when handling a Metadata request. This means an UpdateMetadata 
request can arrive concurrently and break the intended invariant.

It seems case 2 has been possible for a long time, but it should be extremely 
rare. Case 1 was only made possible with KIP-320, which added the leader epoch 
tracking. It should also be rare, but the window for inconsistent metadata is 
probably a bit bigger than the window for a concurrent update.

To fix this, we should make the client more defensive about metadata updates 
and not assume that the leader is among the live endpoints.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to