[jira] [Comment Edited] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

Andrey Elenskiy (JIRA) Thu, 22 Jun 2017 14:38:50 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060049#comment-16060049
 ]


Andrey Elenskiy edited comment on KAFKA-2729 at 6/22/17 9:37 PM:
-----------------------------------------------------------------

Seeing the same issue on 0.10.2. 

A node running zookeeper lost networking for split second and initiated an 
election which caused some sessions to expire with:


{noformat}
2017-06-22 02:07:36,092 [myid:3] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@373] - Exception 
causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
running
{noformat}


which caused controller resignation:


{noformat}
[2017-06-22 02:07:36,363] INFO [SessionExpirationListener on 158980], ZK 
expired; shut down all controller components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: Controller resigning, 
broker id 158980 (kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: De-registering 
IsrChangeNotificationListener (kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] INFO [Partition state machine on Controller 158980]: 
Stopped partition state machine (kafka.controller.PartitionStateMachine)
[2017-06-22 02:07:37,028] INFO [Replica state machine on controller 158980]: 
Stopped replica state machine (kafka.controller.ReplicaStateMachine)
[2017-06-22 02:07:37,028] INFO [Controller 158980]: Broker 158980 resigned as 
the controller (kafka.controller.KafkaController)
{noformat}


and after that just kept getting this in broker's server logs for next 8 hours 
until just restarting manually it:

{noformat}
[2017-06-22 17:41:06,928] INFO Partition [A,5] on broker 158980: Shrinking ISR 
for partition [A,5] from 158980,133641,155394 to 158980 
(kafka.cluster.Partition)
[2017-06-22 17:41:06,935] INFO Partition [A,5] on broker 158980: Cached 
zkVersion [73] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
{noformat}


was (Author: timoha):
Seeing the same issue on 0.10.2. 

A node running zookeeper lost networking for split second and initiated an 
election which caused some sessions to expire with:

{{2017-06-22 02:07:36,092 [myid:3] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@373] - Exception 
causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
running}}

which caused controller resignation:

{{[2017-06-22 02:07:36,363] INFO [SessionExpirationListener on 158980], ZK 
expired; shut down all controller components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: Controller resigning, 
broker id 158980 (kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: De-registering 
IsrChangeNotificationListener (kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] INFO [Partition state machine on Controller 158980]: 
Stopped partition state machine (kafka.controller.PartitionStateMachine)
[2017-06-22 02:07:37,028] INFO [Replica state machine on controller 158980]: 
Stopped replica state machine (kafka.controller.ReplicaStateMachine)
[2017-06-22 02:07:37,028] INFO [Controller 158980]: Broker 158980 resigned as 
the controller (kafka.controller.KafkaController)}}

and after that just kept getting this in broker's server logs for next 8 hours 
until just restarting manually it:

{{[2017-06-22 17:41:06,928] INFO Partition [A,5] on broker 158980: Shrinking 
ISR for partition [A,5] from 158980,133641,155394 to 158980 
(kafka.cluster.Partition)
[2017-06-22 17:41:06,935] INFO Partition [A,5] on broker 158980: Cached 
zkVersion [73] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)}}


> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2729
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2729
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

Reply via email to