[ https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241314#comment-15241314 ]
Flavio Junqueira commented on KAFKA-3042: ----------------------------------------- [~junrao] In this comment: https://issues.apache.org/jira/browse/KAFKA-3042?focusedCommentId=15236055&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15236055 I showed that broker 5 is the one that sent the LeaderAndIsr request to broker 1, and in here: https://issues.apache.org/jira/browse/KAFKA-3042?focusedCommentId=15237383&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15237383 that broker 5 also didn't have broker 4 as a live broker when it sent the request to broker 1. It does sound right that the controller on failover should update the list of live brokers on other brokers before sending requests that make them followers or at least the problem should be transient in the sense that it could be corrected with a later message. However, it sounds like for the partition we are analyzing, there is this additional problem that controller 5 also didn't have broker 4 in its list of live brokers. Interestingly, I also caught an instance of this: {noformat} [2016-04-09 00:37:54,111] DEBUG Sending MetadataRequest to Brokers:ArrayBuffer(2, 5)... [2016-04-09 00:37:54,111] ERROR Haven't been able to send metadata update requests... [2016-04-09 00:37:54,112] ERROR [Controller 5]: Forcing the controller to resign (kafka.controller.KafkaController) {noformat} I don't think this is related, but we have been wondering in another issue about the possible causes of batches in {{ControllerBrokerRequestBatch}} not being empty, and there are a few occurrences of it in these logs. This is happening, however, right after the controller resigns, so I'm guessing this is related to the controller shutting down: {noformat} [2016-04-09 00:37:54,064] INFO [Controller 5]: Broker 5 resigned as the controller (kafka.controller.KafkaController) {noformat} In any case, for this last issue, I'll create a jira to make sure that we have enough info to identify this issue when it happens. Currently, the exception is being propagated, but nowhere we are logging the cause. > updateIsr should stop after failed several times due to zkVersion issue > ----------------------------------------------------------------------- > > Key: KAFKA-3042 > URL: https://issues.apache.org/jira/browse/KAFKA-3042 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.2.1 > Environment: jdk 1.7 > centos 6.4 > Reporter: Jiahongchao > Fix For: 0.10.0.0 > > Attachments: controller.log, server.log.2016-03-23-01, > state-change.log > > > sometimes one broker may repeatly log > "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR" > I think this is because the broker consider itself as the leader in fact it's > a follower. > So after several failed tries, it need to find out who is the leader -- This message was sent by Atlassian JIRA (v6.3.4#6332)