[
https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246508#comment-15246508
]
Jun Rao edited comment on KAFKA-3042 at 4/18/16 8:53 PM:
---------------------------------------------------------
[~fpj], yes, I was confused by that initially too. The following is my finding
after I dug a bit deeper. When a controller receives a broker change event, it
will also log the live brokers after processing the event. So, in the
controller log of broker 5, you will see the following matching entries.
{code}
[2016-04-09 00:41:35,442] INFO [BrokerChangeListener on Controller 5]: Broker
change listener fired for path /brokers/ids with children 1,4,5
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-04-09 00:41:36,235] INFO [BrokerChangeListener on Controller 5]: Newly
added brokers: , deleted brokers: 3, all live brokers: 1,4,5
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
{code}
However, after
{code}
[2016-04-09 00:37:54,079] INFO [BrokerChangeListener on Controller 5]: Broker
change listener fired for path /brokers/ids with children 1,2,3,5
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
{code}
you don't see the corresponding entry on "Newly added broker" afterward. The
reason is that at 00:37:54, broker 5 has resigned as the controller and
therefore ignored the broker change listener event.
{code}
[2016-04-09 00:37:54,064] INFO [Controller 5]: Broker 5 resigned as the
controller (kafka.controller.KafkaController)
{code}
Broker 5 became the controller again a bit later.
{code}
[2016-04-09 00:40:54,990] INFO [Controller 5]: Controller 5 incremented epoch
to 416 (kafka.controller.KafkaController)
[2016-04-09 00:40:57,877] INFO [Controller 5]: Currently active brokers in the
cluster: Set(1, 3, 4, 5) (kafka.controller.KafkaController)
{code}
As you can see, it initialized with 4 live brokers, including broker 4. It then
sent the following LeaderAndIsrRequest to broker 1 after initialization.
{code}
[2016-04-09 00:40:58,089] TRACE Controller 5 epoch 416 sending become-follower
LeaderAndIsr request (Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415) to
broker 1 for partition [tec1.en2.frontend.syncPing,7] (state.change.logger)
{code}
The above LeaderAndIsrRequest likely includes broker 4 in the live broker list
since there is no other indication that the live broker list has changed since
the initialization. However, if broker 4 is not in the metadata cache of broker
1 (which seems to be the case), it can still lead to the following error.
{code}
[2016-04-09 00:40:58,144] ERROR Broker 1 received LeaderAndIsrRequest with
correlation id 0 from controller 5 epoch 415 for partition
[tec1.en2.frontend.syncPing,7] but cannot become follower since the new leader
4 is unavailable. (state.change.logger)
{code}
was (Author: junrao):
[~fpj], yes, I was confused by that initially too. The following is my finding
after I dug a bit deeper. When a controller receives a broker change event, it
will also log the live brokers after processing the event. So, in the
controller log of broker 5, you will see the following matching entries.
{code}
[2016-04-09 00:41:35,442] INFO [BrokerChangeListener on Controller 5]: Broker
change listener fired for path /brokers/ids with children 1,4,5
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-04-09 00:41:36,235] INFO [BrokerChangeListener on Controller 5]: Newly
added brokers: , deleted brokers: 3, all live brokers: 1,4,5
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
{code}
However, after
{code}
[2016-04-09 00:37:54,079] INFO [BrokerChangeListener on Controller 5]: Broker
change listener fired for path /brokers/ids with children 1,2,3,5
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
{code}
you don't see the corresponding entry on "Newly added broker" afterward. The
reason is that at 00:37:54, broker 5 has resigned as the controller and
therefore ignored the broker change listener event.
{code}
[2016-04-09 00:37:54,064] INFO [Controller 5]: Broker 5 resigned as the
controller (kafka.controller.KafkaController)
{code}
Broker 5 became the controller again a bit later.
{code}
[2016-04-09 00:40:54,990] INFO [Controller 5]: Controller 5 incremented epoch
to 416 (kafka.controller.KafkaController)
[2016-04-09 00:40:57,877] INFO [Controller 5]: Currently active brokers in the
cluster: Set(1, 3, 4, 5) (kafka.controller.KafkaController)
{code}
As you can see, it initialized with 4 live brokers, including broker 4. It then
sent the following LeaderAndIsrRequest to broker 1 after initialization.
{code}
[2016-04-09 00:40:58,089] TRACE Controller 5 epoch 416 sending become-follower
LeaderAndIsr request (Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415) to
broker 1 for partition [tec1.en2.frontend.syncPing,7] (state.change.logger)
{code}
The above LeaderAndIsrRequest likely includes broker 4 in the live broker list
since there is no other indication that the live broker list has changed since
the initialization. However, if broker 4 is not the metadata cache of broker 1
(which seems to be the case), it can still lead to the following error.
{code}
[2016-04-09 00:40:58,144] ERROR Broker 1 received LeaderAndIsrRequest with
correlation id 0 from controller 5 epoch 415 for partition
[tec1.en2.frontend.syncPing,7] but cannot become follower since the new leader
4 is unavailable. (state.change.logger)
{code}
> updateIsr should stop after failed several times due to zkVersion issue
> -----------------------------------------------------------------------
>
> Key: KAFKA-3042
> URL: https://issues.apache.org/jira/browse/KAFKA-3042
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.8.2.1
> Environment: jdk 1.7
> centos 6.4
> Reporter: Jiahongchao
> Fix For: 0.10.0.0
>
> Attachments: controller.log, server.log.2016-03-23-01,
> state-change.log
>
>
> sometimes one broker may repeatly log
> "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR"
> I think this is because the broker consider itself as the leader in fact it's
> a follower.
> So after several failed tries, it need to find out who is the leader
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)