[jira] [Created] (KAFKA-10791) Kafka Metadata older epoch problem

KRISHNA SARVEPALLI (Jira) Tue, 01 Dec 2020 08:26:35 -0800

KRISHNA SARVEPALLI created KAFKA-10791:
------------------------------------------


             Summary: Kafka Metadata older epoch problem
                 Key: KAFKA-10791
                 URL: https://issues.apache.org/jira/browse/KAFKA-10791
             Project: Kafka
          Issue Type: Bug
          Components: clients
    Affects Versions: 2.2.0
         Environment: Kubernetes cluster,
            Reporter: KRISHNA SARVEPALLI
         Attachments: Kafka-Client-Issue.png, zookeeper-leader-epoch.png, 
zookeeper-state.png

We are using Kafka in production with 5 brokers and 3 zookeepers. We are 
running Kafka and zookeeper in Kubernetes and storage is managed by PVC using 
NFS. We are using topic with 60 partitions.

The cluster was running successfully from almost 50 days since the last 
restart. Last week (11/28) two brokers were down. Team is still researching for 
the root cause of broker failures. 

Since we are using K8s the brokers came back up immediately (in less than 
5minutes). But we have issue on the producer applications and consumer 
applications while downloading the metadata. Please check the attached images.

We have enabled debug logs for one of the applications and it seems like Kafka 
brokers are returning metadata with leader_epoch value of 0 where as in the 
client Metadata cache it was maintained at 6 for most of the partitions. 

Eventually we are forced to restart all the producer apps (around 35-40 micro 
services) and they are all able to download the metadata since it's first time 
didn't face any issue and was able to produce the messages.

As part of troubleshooting, we have checked the zookeeper key/value pairs 
registered by Kafka and we can see that leader_epoch was set back to 0 for 
almost all partitions. And we have checked for another topic which is used by 
other apps, their leader_epoch was in good shape and ctime and mtime are also 
updated correctly. Please check the attached screenshots.

Please refer the stackoverflow issue that we have reported:

https://stackoverflow.com/questions/65055299/kafka-producer-not-able-to-download-refresh-metadata-after-brokers-were-restar

 

+*Broker Configs:*+

--override zookeeper.connect=zookeeper:2181 
 --override advertised.listeners=PLAINTEXT://kafka,SASL_SSL://kafka
 --override log.dirs=/opt/kafka/data/logs 
 --override broker.id=kafka
 --override num.network.threads=3 
 --override num.io.threads=8 
 --override default.replication.factor=3 
 --override auto.create.topics.enable=true 
 --override delete.topic.enable=true 
 --override socket.send.buffer.bytes=102400 
 --override socket.receive.buffer.bytes=102400 
 --override socket.request.max.bytes=104857600 
 --override num.partitions=30 
 --override num.recovery.threads.per.data.dir=1 
 --override offsets.topic.replication.factor=3 
 --override transaction.state.log.replication.factor=3 
 --override transaction.state.log.min.isr=1 
 --override log.retention.hours=48 
 --override log.segment.bytes=1073741824 
 --override log.retention.check.interval.ms=300000 
 --override zookeeper.connection.timeout.ms=6000 
 --override confluent.support.metrics.enable=true 
 --override group.initial.rebalance.delay.ms=0 
 --override confluent.support.customer.id=anonymous 
 --override ssl.truststore.location=kafka.broker.truststore.jks 
 --override ssl.truststore.password=changeit 
 --override ssl.keystore.location=kafka.broker.keystore.jks 
 --override ssl.keystore.password=changeit 
 --override ssl.keystore.type=PKCS12 
 --override ssl.key.password=changeit 
 --override listeners=SASL_SSL://0.0.0.0:9093,PLAINTEXT://0.0.0.0:9092 
 --override authorizer_class_name=kafka.security.auth.SimpleAclAuthorizer 
 --override ssl.endpoint.identification.algorithm 
 --override ssl.client.auth=requested 
 --override sasl.enabled.mechanisms=SCRAM-SHA-512 
 --override sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 
 --override security.inter.broker.protocol=SASL_SSL 
 --override super.users=test:test
 --override zookeeper.set.acl=false

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-10791) Kafka Metadata older epoch problem

Reply via email to