[
https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236055#comment-15236055
]
Flavio Junqueira edited comment on KAFKA-3042 at 4/12/16 2:58 PM:
------------------------------------------------------------------
Thanks for the logs. I still don’t have an answer, but I wanted to report some
progress.
I tracked one of the partitions that I’ve chosen arbitrarily for which broker 1
was complaining about the zk version: [tec1.en2.frontend.syncPing,7]. Here are
a few things that I’ve observed:
Broker 1 is the leader of the partition:
{noformat}
[2016-04-09 00:40:52,883] TRACE Broker 1 cached leader info
(LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:363,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:1,2,4)
for partition [tec1.en2.frontend.syncPing,7] in response to UpdateMetadata
request sent by controller 1 epoch 414 with correlation id 876
(state.change.logger)
{noformat}
but soon after it fails to release leadership to broker 4:
{noformat}
[2016-04-09 00:40:58,106] TRACE Broker 1 received LeaderAndIsr request
(LeaderAndIsrInfo:(Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415),ReplicationFactor:3),AllReplicas:1,2,4)
correlation id 0 from controller 5 epoch 416 for partition
[tec1.en2.frontend.syncPing,7] (state.change.logger)
[2016-04-09 00:40:58,139] TRACE Broker 1 handling LeaderAndIsr request
correlationId 0 from controller 5 epoch 416 starting the become-follower
transition for partition [tec1.en2.frontend.syncPing,7] (state.change.logger)
[2016-04-09 00:40:58,144] ERROR Broker 1 received LeaderAndIsrRequest with
correlation id 0 from controller 5 epoch 415 for partition
[tec1.en2.frontend.syncPing,7] but cannot become follower since the new leader
4 is unavailable. (state.change.logger)
{noformat}
Now, a bit later in the log, the broker says that it is caching the leader info
for the partition:
{noformat}
[2016-04-09 00:42:02,456] TRACE Broker 1 cached leader info
(LeaderAndIsrInfo:(Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415),ReplicationFactor:3),AllReplicas:1,2,4)
for partition [tec1.en2.frontend.syncPing,7] in response to UpdateMetadata
request sent by controller 5 epoch 416 with correlation id 1473
(state.change.logger)
{noformat}
but it keeps printing the “Cached zkVersion…” errors, which indicate that the
broker still believes it is the leader of the partition, or at least the
variable {{leaderReplicaIdOpt}} is set this way.
I also inspected other partitions, and the behavior doesn’t seem to be
consistent. I’ve seen at least one partition in broker 2 for which the broker
made the appropriate transition:
{noformat}
[2016-04-09 00:39:23,840] TRACE Broker 2 received LeaderAndIsr request
(LeaderAndIsrInfo:(Leader:3,ISR:2,3,LeaderEpoch:305,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:3,2,4)
correlation id 535 from controller 1 epoch 414 for partition
[tec1.ono_qe1.bodydata.recordings,20] (state.change.logger)
[2016-04-09 00:39:23,841] TRACE Broker 2 handling LeaderAndIsr request
correlationId 535 from controller 1 epoch 414 starting the become-follower
transition for partition [tec1.ono_qe1.bodydata.recordings,20]
(state.change.logger)
[2016-04-09 00:39:23,841] TRACE Broker 2 stopped fetchers as part of
become-follower request from controller 1 epoch 414 with correlation id 535 for
partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger)
[2016-04-09 00:39:23,856] TRACE Broker 2 truncated logs and checkpointed
recovery boundaries for partition [tec1.ono_qe1.bodydata.recordings,20] as part
of become-follower request with correlation id 535 from controller 1 epoch 414
(state.change.logger)
[2016-04-09 00:39:23,856] TRACE Broker 2 started fetcher to new leader as part
of become-follower request from controller 1 epoch 414 with correlation id 535
for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger)
[2016-04-09 00:39:23,856] TRACE Broker 2 completed LeaderAndIsr request
correlationId 535 from controller 1 epoch 414 for the become-follower
transition for partition [tec1.ono_qe1.bodydata.recordings,20]
(state.change.logger)
{noformat}
Actually, the state-change log of broker 2 seems to have a gap starting at
{{[2016-04-09 00:39:46,246]}}. Is it when you’ve restarted the broker?
was (Author: fpj):
Thanks for the logs. I still don’t have an answer, but I wanted to report some
progress.
I tracked one of the partitions that I’ve chosen arbitrarily for which broker 1
was complaining about the zk version: [tec1.en2.frontend.syncPing,7]. Here are
a few things that I’ve observed:
Broker 1 is the leader of the partition:
{noformat}
[2016-04-09 00:40:52,883] TRACE Broker 1 cached leader info
(LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:363,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:1,2,4)
for partition [tec1.en2.frontend.syncPing,7] in response to UpdateMetadata
request sent by controller 1 epoch 414 with correlation id 876
(state.change.logger)
{noformat}
but soon after it fails to release leadership to broker 4:
{noformat}
[2016-04-09 00:40:58,139] TRACE Broker 1 handling LeaderAndIsr request
correlationId 0 from controller 5 epoch 416 starting the become-follower
transition for partition [tec1.en2.frontend.syncPing,7] (state.change.logger)
[2016-04-09 00:40:58,144] ERROR Broker 1 received LeaderAndIsrRequest with
correlation id 0 from controller 5 epoch 415 for partition
[tec1.en2.frontend.syncPing,7] but cannot become follower since the new leader
4 is unavailable. (state.change.logger)
{noformat}
Now, a bit later in the log, the broker says that it is caching the leader info
for the partition:
{noformat}
[2016-04-09 00:42:02,456] TRACE Broker 1 cached leader info
(LeaderAndIsrInfo:(Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415),ReplicationFactor:3),AllReplicas:1,2,4)
for partition [tec1.en2.frontend.syncPing,7] in response to UpdateMetadata
request sent by controller 5 epoch 416 with correlation id 1473
(state.change.logger)
{noformat}
but it keeps printing the “Cached zkVersion…” errors, which indicate that the
broker still believes it is the leader of the partition, or at least the
variable {{leaderReplicaIdOpt}} is set this way.
I also inspected other partitions, and the behavior doesn’t seem to be
consistent. I’ve seen at least one partition in broker 2 for which the broker
made the appropriate transition:
{noformat}
[2016-04-09 00:39:23,840] TRACE Broker 2 received LeaderAndIsr request
(LeaderAndIsrInfo:(Leader:3,ISR:2,3,LeaderEpoch:305,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:3,2,4)
correlation id 535 from controller 1 epoch 414 for partition
[tec1.ono_qe1.bodydata.recordings,20] (state.change.logger)
[2016-04-09 00:39:23,841] TRACE Broker 2 handling LeaderAndIsr request
correlationId 535 from controller 1 epoch 414 starting the become-follower
transition for partition [tec1.ono_qe1.bodydata.recordings,20]
(state.change.logger)
[2016-04-09 00:39:23,841] TRACE Broker 2 stopped fetchers as part of
become-follower request from controller 1 epoch 414 with correlation id 535 for
partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger)
[2016-04-09 00:39:23,856] TRACE Broker 2 truncated logs and checkpointed
recovery boundaries for partition [tec1.ono_qe1.bodydata.recordings,20] as part
of become-follower request with correlation id 535 from controller 1 epoch 414
(state.change.logger)
[2016-04-09 00:39:23,856] TRACE Broker 2 started fetcher to new leader as part
of become-follower request from controller 1 epoch 414 with correlation id 535
for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger)
[2016-04-09 00:39:23,856] TRACE Broker 2 completed LeaderAndIsr request
correlationId 535 from controller 1 epoch 414 for the become-follower
transition for partition [tec1.ono_qe1.bodydata.recordings,20]
(state.change.logger)
{noformat}
Actually, the state-change log of broker 2 seems to have a gap starting at
{{[2016-04-09 00:39:46,246]}}. Is it when you’ve restarted the broker?
> updateIsr should stop after failed several times due to zkVersion issue
> -----------------------------------------------------------------------
>
> Key: KAFKA-3042
> URL: https://issues.apache.org/jira/browse/KAFKA-3042
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.8.2.1
> Environment: jdk 1.7
> centos 6.4
> Reporter: Jiahongchao
> Attachments: controller.log, server.log.2016-03-23-01,
> state-change.log
>
>
> sometimes one broker may repeatly log
> "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR"
> I think this is because the broker consider itself as the leader in fact it's
> a follower.
> So after several failed tries, it need to find out who is the leader
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)