[ 
https://issues.apache.org/jira/browse/KAFKA-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443353#comment-17443353
 ] 

Vinson Zhang commented on KAFKA-13407:
--------------------------------------

[~Olsson] 
You can try to remove the environment variable 
'java.security.auth.login.config' and declare the JAAS file content in Kafka 
server.properties.

It looks like ZK auth failure triggered the Controller resign.

It works on my local.
Thanks.

> Kafka controller out of service after ZK leader restart
> -------------------------------------------------------
>
>                 Key: KAFKA-13407
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13407
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.8.0, 2.8.1
>         Environment: Ubuntu 20.04
>            Reporter: Daniel
>            Priority: Critical
>
> When the Zookeeper leader disappears, a new instance becomes the leader, the 
> instances need to reconnect to Zookeeper, but the Kafka "Controller" gets 
> lost in limbo state after re-establishing connection.
> See below for how I manage to reproduce this over and over.
> *Prerequisites*
> Have a Kafka cluster with 3 instances running version 2.8.1. Figure out which 
> one is the Controller. I'm using Kafkacat 1.5.0 and get this info using the 
> `-L` flag.
> Zookeeper runs with 3 instances on version 3.5.9. Figure out which one is 
> leader by checking
>  
> {code:java}
> echo stat | nc -v localhost 2181
> {code}
>  
>  
> *Reproduce*
> 1. Stop the leader Zookeeper service.
> 2. Watch the logs of the Kafka Controller and ensure that it reconnects and 
> registers again.
>  
> {code:java}
> Oct 27 09:13:08 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:08,882] INFO 
> Unable to read additional data from server sessionid 0x1f2a12870003, likely 
> server has closed socket, closing socket connection and attempting reconnect 
> (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] WARN 
> SASL configuration failed: javax.security.auth.login.LoginException: No JAAS 
> configuration section named 'Client' was found in specified JAAS 
> configuration file: '/opt/kafka/config/kafka_server_jaas.conf'. Will continue 
> connection to Zookeeper server without SASL authentication, if Zookeeper 
> server allows it. (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] INFO 
> Opening socket connection to server 
> zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 
> (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] ERROR 
> [ZooKeeperClient Kafka server] Auth failed. (kafka.zookeeper.ZooKeeperClient)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,549] INFO 
> Socket connection established, initiating session, client: 
> /10.10.85.215:39338, server: 
> zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 
> (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,569] INFO 
> Session establishment complete on server 
> zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181, 
> sessionid = 0x1f2a12870003, negotiated timeout = 18000 
> (org.apache.zookeeper.ClientCnxn)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,548] INFO 
> [ZooKeeperClient Kafka server] Reinitializing due to auth failure. 
> (kafka.zookeeper.ZooKeeperClient)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO 
> [PartitionStateMachine controllerId=1003] Stopped partition state machine 
> (kafka.controller.ZkPartitionStateMachine)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO 
> [ReplicaStateMachine controllerId=1003] Stopped replica state machine 
> (kafka.controller.ZkReplicaStateMachine)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO 
> [RequestSendThread controllerId=1003] Shutting down 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO 
> [RequestSendThread controllerId=1003] Stopped 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO 
> [RequestSendThread controllerId=1003] Shutdown completed 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO 
> [RequestSendThread controllerId=1003] Shutting down 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO 
> [RequestSendThread controllerId=1003] Stopped 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO 
> [RequestSendThread controllerId=1003] Shutdown completed 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO 
> [RequestSendThread controllerId=1003] Shutting down 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO 
> [RequestSendThread controllerId=1003] Stopped 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO 
> [RequestSendThread controllerId=1003] Shutdown completed 
> (kafka.controller.RequestSendThread)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,556] INFO 
> Processing notification(s) to /config/changes 
> (kafka.common.ZkNodeChangeNotificationListener)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,557] INFO 
> [Controller id=1003] Resigned (kafka.controller.KafkaController)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,557] INFO 
> Creating /brokers/ids/1003 (is it secure? false) (kafka.zk.KafkaZkClient)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,560] INFO 
> Updated cache from existing FinalizedFeaturesAndEpoch(features=Features{}, 
> epoch=1) to latest FinalizedFeaturesAndEpoch(features=Features{}, epoch=1). 
> (kafka.server.FinalizedFeatureCache)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,572] INFO 
> Stat of the created znode at /brokers/ids/1003 is: 
> 128849019044,128849019044,1635323000996,1635323000996,1,0,0,34265559924739,389,0,128849019044
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]:  (kafka.zk.KafkaZkClient)
> Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,572] INFO 
> Registered broker 1003 at path /brokers/ids/1003 with addresses: 
> PLAINTEXT://kafka-tr-2.node.consul.lab.aws.blue.example.net:9093,SASL_SSL://kafka-tr-2.node.consul.lab.aws.blue.example.net:9092,
>  czxid (broker epoch): 128849019044 (kafka.zk.KafkaZkClient)
> {code}
>  
>  
> Now it is in this limbo state but to notice the problem create a topic for 
> example.
>  
> {code:java}
> ./kafka-topics.sh --zookeeper 
> zookeeper-kafka.service.consul.lab.aws.blue.example.net:2181/kafka_tr 
> --create --topic danols-ts-4 --partitions 10 --replication-factor 3
> Created topic danols-ts-4.
> {code}
>  
>  
> Then when describing the topic it looks like this:
>  
> {code:java}
> ./kafka-topics.sh --zookeeper 
> zookeeper-kafka.service.consul.lab.aws.blue.example.net:2181/kafka_tr 
> --describe --topic danols-ts-4
> Topic: danols-ts-4    PartitionCount: 10      ReplicationFactor: 3    
> Configs: 
>       Topic: danols-ts-4      Partition: 0    Leader: none    Replicas: 
> 1003,1001,1002        Isr: 
>       Topic: danols-ts-4      Partition: 1    Leader: none    Replicas: 
> 1001,1002,1003        Isr: 
>       Topic: danols-ts-4      Partition: 2    Leader: none    Replicas: 
> 1002,1003,1001        Isr: 
>       Topic: danols-ts-4      Partition: 3    Leader: none    Replicas: 
> 1003,1002,1001        Isr: 
>       Topic: danols-ts-4      Partition: 4    Leader: none    Replicas: 
> 1001,1003,1002        Isr: 
>       Topic: danols-ts-4      Partition: 5    Leader: none    Replicas: 
> 1002,1001,1003        Isr: 
>       Topic: danols-ts-4      Partition: 6    Leader: none    Replicas: 
> 1003,1001,1002        Isr: 
>       Topic: danols-ts-4      Partition: 7    Leader: none    Replicas: 
> 1001,1002,1003        Isr: 
>       Topic: danols-ts-4      Partition: 8    Leader: none    Replicas: 
> 1002,1003,1001        Isr: 
>       Topic: danols-ts-4      Partition: 9    Leader: none    Replicas: 
> 1003,1002,1001        Isr: 
> {code}
>  
>  
> If I describe using the --bootstrap-server flag instead it doesn't exist
> {code:java}
> ./kafka-topics.sh --bootstrap-server localhost:9093 --describe --topic 
> danols-ts-4
> Error while executing topic command : Topic 'danols-ts-4' does not exist as 
> expected
> [2021-10-27 09:24:15,691] ERROR java.lang.IllegalArgumentException: Topic 
> 'danols-ts-4' does not exist as expected
>       at 
> kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:542)
>       at 
> kafka.admin.TopicCommand$AdminClientTopicService.describeTopic(TopicCommand.scala:317)
>       at kafka.admin.TopicCommand$.main(TopicCommand.scala:69)
>       at kafka.admin.TopicCommand.main(TopicCommand.scala)
>  (kafka.admin.TopicCommand$)
> {code}
> If I restart the Kafka service on a broker that is *not* "Controller" it 
> doesn't come up again. It just spams the following in the log for all 
> different partitions and it's not possible to connect to it.
> {code:java}
> Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: 
> org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication 
> factor: 3 larger than available brokers: 0.
> Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: [2021-10-27 09:27:48,984] INFO 
> [Admin Manager on Broker 1001]: Error processing create topic request 
> CreatableTopic(name='example-message-consumer', numPartitions=10, 
> replicationFactor=3, assignments=[], configs=[]) (kafka.server.ZkAdminManager)
> Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: 
> org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication 
> factor: 3 larger than available brokers: 0.
> Oct 27 09:27:49 ip-10-226-69-122 kafka[67342]: [2021-10-27 09:27:49,021] INFO 
> [Admin Manager on Broker 1001]: Error processing create topic request 
> CreatableTopic(name='__consumer_offsets', numPartitions=50, 
> replicationFactor=3, assignments=[], 
> configs=[CreateableTopicConfig(name='compression.type', value='producer'), 
> CreateableTopicConfig(name='cleanup.policy', value='compact'), 
> CreateableTopicConfig(name='segment.bytes', value='104857600')]) 
> (kafka.server.ZkAdminManager)
> {code}
> To get out of this state I simply restart the Kafka service on the 
> "Controller". It will make another become "Controller" and all brokers 
> becomes OK again and the newly created topic gets all expected ISRs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to