[jira] [Commented] (KAFKA-13720) Few topic partitions remain under replicated after broker lose connectivity to zookeeper

Luke Chen (Jira) Thu, 16 Jun 2022 20:04:12 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-13720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555363#comment-17555363
 ]


Luke Chen commented on KAFKA-13720:
-----------------------------------

[~dhirendr...@gmail.com], thanks for reporting the issue. I can confirm this 
issue if fixed (indirectly) in Kafka v3.1 and later. The root cause of this 
issue is that when controller is changing, the ISR update request should keep 
retrying, until the "real" controller got this request. So, in the log you see, 
there's only 1 error about "Failed to update ISR to PendingExpandIsr", and no 
more. It should keep retrying, but some bugs in the code cause it didn't. In 
v3.1 and later will not have this issue. Thanks.

> Few topic partitions remain under replicated after broker lose connectivity 
> to zookeeper
> ----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-13720
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13720
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 2.7.1
>            Reporter: Dhirendra Singh
>            Priority: Major
>
> Few topic partitions remain under replicated after broker lose connectivity 
> to zookeeper.
> It only happens when brokers lose connectivity to zookeeper and it results in 
> change in active controller. Issue does not occur always but randomly.
> Issue never occurs when there is no change in active controller when brokers 
> lose connectivity to zookeeper.
> Following error message i found in the log file.
> [2022-02-28 04:01:20,217] WARN [Partition __consumer_offsets-4 broker=1] 
> Controller failed to update ISR to PendingExpandIsr(isr=Set(1), 
> newInSyncReplicaId=2) due to unexpected UNKNOWN_SERVER_ERROR. Retrying. 
> (kafka.cluster.Partition)
> [2022-02-28 04:01:20,217] ERROR [broker-1-to-controller] Uncaught error in 
> request completion: (org.apache.kafka.clients.NetworkClient)
> java.lang.IllegalStateException: Failed to enqueue `AlterIsr` request with 
> state LeaderAndIsr(leader=1, leaderEpoch=2728, isr=List(1, 2), 
> zkVersion=4719) for partition __consumer_offsets-4
> at kafka.cluster.Partition.sendAlterIsrRequest(Partition.scala:1403)
> at 
> kafka.cluster.Partition.$anonfun$handleAlterIsrResponse$1(Partition.scala:1438)
> at kafka.cluster.Partition.handleAlterIsrResponse(Partition.scala:1417)
> at 
> kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1(Partition.scala:1398)
> at 
> kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1$adapted(Partition.scala:1398)
> at 
> kafka.server.AlterIsrManagerImpl.$anonfun$handleAlterIsrResponse$8(AlterIsrManager.scala:166)
> at 
> kafka.server.AlterIsrManagerImpl.$anonfun$handleAlterIsrResponse$8$adapted(AlterIsrManager.scala:163)
> at scala.collection.immutable.List.foreach(List.scala:333)
> at 
> kafka.server.AlterIsrManagerImpl.handleAlterIsrResponse(AlterIsrManager.scala:163)
> at 
> kafka.server.AlterIsrManagerImpl.responseHandler$1(AlterIsrManager.scala:94)
> at 
> kafka.server.AlterIsrManagerImpl.$anonfun$sendRequest$2(AlterIsrManager.scala:104)
> at 
> kafka.server.BrokerToControllerRequestThread.handleResponse(BrokerToControllerChannelManagerImpl.scala:175)
> at 
> kafka.server.BrokerToControllerRequestThread.$anonfun$generateRequests$1(BrokerToControllerChannelManagerImpl.scala:158)
> at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
> at 
> org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:586)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:578)
> at kafka.common.InterBrokerSendThread.doWork(InterBrokerSendThread.scala:71)
> at 
> kafka.server.BrokerToControllerRequestThread.doWork(BrokerToControllerChannelManagerImpl.scala:183)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
>  
> under replication count goes to zero after the controller broker is restarted 
> again. but this require manual intervention.
> Expectation is that when broker reconnect with zookeeper cluster should come 
> back to stable state with under replication count as zero by itself without 
> any manual intervention.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (KAFKA-13720) Few topic partitions remain under replicated after broker lose connectivity to zookeeper

Reply via email to