[ https://issues.apache.org/jira/browse/KAFKA-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443781#comment-17443781 ]
Edoardo Comar commented on KAFKA-13407: --------------------------------------- [~vinsonZhang] are you suggesting loss of connection with ZK should not cause a controller to resign? > Kafka controller out of service after ZK leader restart > ------------------------------------------------------- > > Key: KAFKA-13407 > URL: https://issues.apache.org/jira/browse/KAFKA-13407 > Project: Kafka > Issue Type: Bug > Affects Versions: 2.8.0, 2.8.1 > Environment: Ubuntu 20.04 > Reporter: Daniel > Priority: Critical > > When the Zookeeper leader disappears, a new instance becomes the leader, the > instances need to reconnect to Zookeeper, but the Kafka "Controller" gets > lost in limbo state after re-establishing connection. > See below for how I manage to reproduce this over and over. > *Prerequisites* > Have a Kafka cluster with 3 instances running version 2.8.1. Figure out which > one is the Controller. I'm using Kafkacat 1.5.0 and get this info using the > `-L` flag. > Zookeeper runs with 3 instances on version 3.5.9. Figure out which one is > leader by checking > > {code:java} > echo stat | nc -v localhost 2181 > {code} > > > *Reproduce* > 1. Stop the leader Zookeeper service. > 2. Watch the logs of the Kafka Controller and ensure that it reconnects and > registers again. > > {code:java} > Oct 27 09:13:08 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:08,882] INFO > Unable to read additional data from server sessionid 0x1f2a12870003, likely > server has closed socket, closing socket connection and attempting reconnect > (org.apache.zookeeper.ClientCnxn) > Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] WARN > SASL configuration failed: javax.security.auth.login.LoginException: No JAAS > configuration section named 'Client' was found in specified JAAS > configuration file: '/opt/kafka/config/kafka_server_jaas.conf'. Will continue > connection to Zookeeper server without SASL authentication, if Zookeeper > server allows it. (org.apache.zookeeper.ClientCnxn) > Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] INFO > Opening socket connection to server > zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 > (org.apache.zookeeper.ClientCnxn) > Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] ERROR > [ZooKeeperClient Kafka server] Auth failed. (kafka.zookeeper.ZooKeeperClient) > Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,549] INFO > Socket connection established, initiating session, client: > /10.10.85.215:39338, server: > zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 > (org.apache.zookeeper.ClientCnxn) > Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,569] INFO > Session establishment complete on server > zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181, > sessionid = 0x1f2a12870003, negotiated timeout = 18000 > (org.apache.zookeeper.ClientCnxn) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,548] INFO > [ZooKeeperClient Kafka server] Reinitializing due to auth failure. > (kafka.zookeeper.ZooKeeperClient) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO > [PartitionStateMachine controllerId=1003] Stopped partition state machine > (kafka.controller.ZkPartitionStateMachine) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO > [ReplicaStateMachine controllerId=1003] Stopped replica state machine > (kafka.controller.ZkReplicaStateMachine) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO > [RequestSendThread controllerId=1003] Shutting down > (kafka.controller.RequestSendThread) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO > [RequestSendThread controllerId=1003] Stopped > (kafka.controller.RequestSendThread) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO > [RequestSendThread controllerId=1003] Shutdown completed > (kafka.controller.RequestSendThread) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO > [RequestSendThread controllerId=1003] Shutting down > (kafka.controller.RequestSendThread) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO > [RequestSendThread controllerId=1003] Stopped > (kafka.controller.RequestSendThread) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO > [RequestSendThread controllerId=1003] Shutdown completed > (kafka.controller.RequestSendThread) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO > [RequestSendThread controllerId=1003] Shutting down > (kafka.controller.RequestSendThread) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO > [RequestSendThread controllerId=1003] Stopped > (kafka.controller.RequestSendThread) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO > [RequestSendThread controllerId=1003] Shutdown completed > (kafka.controller.RequestSendThread) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,556] INFO > Processing notification(s) to /config/changes > (kafka.common.ZkNodeChangeNotificationListener) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,557] INFO > [Controller id=1003] Resigned (kafka.controller.KafkaController) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,557] INFO > Creating /brokers/ids/1003 (is it secure? false) (kafka.zk.KafkaZkClient) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,560] INFO > Updated cache from existing FinalizedFeaturesAndEpoch(features=Features{}, > epoch=1) to latest FinalizedFeaturesAndEpoch(features=Features{}, epoch=1). > (kafka.server.FinalizedFeatureCache) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,572] INFO > Stat of the created znode at /brokers/ids/1003 is: > 128849019044,128849019044,1635323000996,1635323000996,1,0,0,34265559924739,389,0,128849019044 > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: (kafka.zk.KafkaZkClient) > Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,572] INFO > Registered broker 1003 at path /brokers/ids/1003 with addresses: > PLAINTEXT://kafka-tr-2.node.consul.lab.aws.blue.example.net:9093,SASL_SSL://kafka-tr-2.node.consul.lab.aws.blue.example.net:9092, > czxid (broker epoch): 128849019044 (kafka.zk.KafkaZkClient) > {code} > > > Now it is in this limbo state but to notice the problem create a topic for > example. > > {code:java} > ./kafka-topics.sh --zookeeper > zookeeper-kafka.service.consul.lab.aws.blue.example.net:2181/kafka_tr > --create --topic danols-ts-4 --partitions 10 --replication-factor 3 > Created topic danols-ts-4. > {code} > > > Then when describing the topic it looks like this: > > {code:java} > ./kafka-topics.sh --zookeeper > zookeeper-kafka.service.consul.lab.aws.blue.example.net:2181/kafka_tr > --describe --topic danols-ts-4 > Topic: danols-ts-4 PartitionCount: 10 ReplicationFactor: 3 > Configs: > Topic: danols-ts-4 Partition: 0 Leader: none Replicas: > 1003,1001,1002 Isr: > Topic: danols-ts-4 Partition: 1 Leader: none Replicas: > 1001,1002,1003 Isr: > Topic: danols-ts-4 Partition: 2 Leader: none Replicas: > 1002,1003,1001 Isr: > Topic: danols-ts-4 Partition: 3 Leader: none Replicas: > 1003,1002,1001 Isr: > Topic: danols-ts-4 Partition: 4 Leader: none Replicas: > 1001,1003,1002 Isr: > Topic: danols-ts-4 Partition: 5 Leader: none Replicas: > 1002,1001,1003 Isr: > Topic: danols-ts-4 Partition: 6 Leader: none Replicas: > 1003,1001,1002 Isr: > Topic: danols-ts-4 Partition: 7 Leader: none Replicas: > 1001,1002,1003 Isr: > Topic: danols-ts-4 Partition: 8 Leader: none Replicas: > 1002,1003,1001 Isr: > Topic: danols-ts-4 Partition: 9 Leader: none Replicas: > 1003,1002,1001 Isr: > {code} > > > If I describe using the --bootstrap-server flag instead it doesn't exist > {code:java} > ./kafka-topics.sh --bootstrap-server localhost:9093 --describe --topic > danols-ts-4 > Error while executing topic command : Topic 'danols-ts-4' does not exist as > expected > [2021-10-27 09:24:15,691] ERROR java.lang.IllegalArgumentException: Topic > 'danols-ts-4' does not exist as expected > at > kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:542) > at > kafka.admin.TopicCommand$AdminClientTopicService.describeTopic(TopicCommand.scala:317) > at kafka.admin.TopicCommand$.main(TopicCommand.scala:69) > at kafka.admin.TopicCommand.main(TopicCommand.scala) > (kafka.admin.TopicCommand$) > {code} > If I restart the Kafka service on a broker that is *not* "Controller" it > doesn't come up again. It just spams the following in the log for all > different partitions and it's not possible to connect to it. > {code:java} > Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: > org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication > factor: 3 larger than available brokers: 0. > Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: [2021-10-27 09:27:48,984] INFO > [Admin Manager on Broker 1001]: Error processing create topic request > CreatableTopic(name='example-message-consumer', numPartitions=10, > replicationFactor=3, assignments=[], configs=[]) (kafka.server.ZkAdminManager) > Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: > org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication > factor: 3 larger than available brokers: 0. > Oct 27 09:27:49 ip-10-226-69-122 kafka[67342]: [2021-10-27 09:27:49,021] INFO > [Admin Manager on Broker 1001]: Error processing create topic request > CreatableTopic(name='__consumer_offsets', numPartitions=50, > replicationFactor=3, assignments=[], > configs=[CreateableTopicConfig(name='compression.type', value='producer'), > CreateableTopicConfig(name='cleanup.policy', value='compact'), > CreateableTopicConfig(name='segment.bytes', value='104857600')]) > (kafka.server.ZkAdminManager) > {code} > To get out of this state I simply restart the Kafka service on the > "Controller". It will make another become "Controller" and all brokers > becomes OK again and the newly created topic gets all expected ISRs. -- This message was sent by Atlassian Jira (v8.20.1#820001)