[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2021-06-29 Thread l0co (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371202#comment-17371202
 ] 

l0co commented on KAFKA-2729:
-

[~junrao] thanks for the reply. Unfortunately from preserved logs from this 
breakdown I only have this useful:
{code:java}
[2021-06-22 14:06:50,637] INFO 1/kafka0/server.log.2021-06-22-14: [Partition 
__consumer_offsets-30 broker=0] __consumer_offsets-30 starts at Leader Epoch 
117 from offset 2612283. Previous Leader Epoch was: 116 
(kafka.cluster.Partition)
[2021-06-22 14:07:04,184] INFO 1/kafka1/server.log.2021-06-22-14: [Partition 
__consumer_offsets-30 broker=1] Shrinking ISR from 1,2,0 to 1,2 
(kafka.cluster.Partition)
[2021-06-22 14:07:04,186] INFO 1/kafka1/server.log.2021-06-22-14: [Partition 
__consumer_offsets-30 broker=1] Cached zkVersion [212] not equal to that in 
zookeeper, skip updating ISR (kafka.cluster.Partition)
[2021-06-22 14:07:09,146] INFO 1/kafka1/server.log.2021-06-22-14: [Partition 
__consumer_offsets-30 broker=1] Shrinking ISR from 1,2,0 to 1,2 
(kafka.cluster.Partition)
[2021-06-22 14:07:09,147] INFO 1/kafka1/server.log.2021-06-22-14: [Partition 
__consumer_offsets-30 broker=1] Cached zkVersion [212] not equal to that in 
zookeeper, skip updating ISR (kafka.cluster.Partition)
{code}
After the zookeeper reconnection in kafka0, kafka0 becomes the leader with 
epoch 117, and then kafka1 starts to complain that cached zkVersion is not 212, 
which is a greater number. What does it mean for you? We suspect that zookeeper 
of kafka0 has been disconnected from kafka1 and kafka2 zookeepers and 
established its own separate cluster, and then after all zookeepers got back 
into one cluster, it became inconsistent. Does it make sense for you?

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2021-06-24 Thread l0co (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368661#comment-17368661
 ] 

l0co commented on KAFKA-2729:
-

This problem is certainly not fixed in `1.1.0` as we still experience it with 
this Kafka version. This ticket should be reopened, unless the problem is being 
resolved elsewhere (KAFKA-3042, KAFKA-7888?).

Our scenario is the following: we have `kafka0`, `kafka1` and `kafka2` nodes.

1. `kafka0` loses zookeper connection
{code:java}
WARN Unable to reconnect to ZooKeeper service, session 0x27a31276f6d has 
expired (org.apache.zookeeper.ClientCnxn)
INFO Unable to reconnect to ZooKeeper service, session 0x27a31276f6d has 
expired, closing socket connection (org.apache.zookeeper.ClientCnxn)
INFO EventThread shut down for session: 0x27a31276f6d 
(org.apache.zookeeper.ClientCnxn)
{code}
2. However, a second later the connection is established properly:
{code:java}
[ZooKeeperClient] Initializing a new session to [...] 
(kafka.zookeeper.ZooKeeperClient)
[2021-06-22 14:06:47,838] INFO Opening socket connection to server [...]. Will 
not attempt to authenticate using SASL (unknown error) 
(org.apache.zookeeper.ClientCnxn)
[2021-06-22 14:06:47,873] INFO Socket connection established to [...], 
initiating session (org.apache.zookeeper.ClientCnxn)
[2021-06-22 14:06:47,933] INFO Creating /brokers/ids/0 (is it secure? false) 
(kafka.zk.KafkaZkClient)
[2021-06-22 14:06:47,959] INFO Session establishment complete on server [...], 
sessionid = 0x27a31276f6d0003, negotiated timeout = 6000 
(org.apache.zookeeper.ClientCnxn)
{code}
3. But a few seconds later `ReplicaFetcherThread` is shut down in `kafka0`:
{code:java}
INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Shutting down 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Stopped 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Shutdown completed 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Shutting down 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Stopped 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Shutdown completed 
(kafka.server.ReplicaFetcherThread)
{code}
We suppose this shutdown is the source of the problem.

4. Now, because of no replication requests from `kafka0` to `kafka1` and 
`kafka2`, `kafka1` and `kafka2` shink ISR list and start to complain about 
zkVersion.
{code:java}
INFO [Partition __consumer_offsets-30 broker=1] Shrinking ISR from 1,2,0 to 1,2 
(kafka.cluster.Partition)
INFO [Partition __consumer_offsets-30 broker=1] Cached zkVersion [212] not 
equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
{code}
This happens forever, until the whole cluster is restarted. Note, that cluster 
state is inconsistent now because `kafka0` stops to be a replica for `kafka1` 
and `kafka2`, but `kafka1` and `kafka2` are still working as replicas for 
`kafka0`. This is due to `ReplicationFetcherThread` has only been stopped in 
`kafka0`.

5. Finally, the whole kafka cluster doesn't work and stops processing events, 
at least for partitions leaded by `kafka0` because of:
{code:java}
ERROR [ReplicaManager broker=0] Error processing append operation on partition 
__consumer_offsets-18 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.NotEnoughReplicasException: Number of insync 
replicas for partition __consumer_offsets-18 is [1], below required minimum [2]
{code}
We also suspect that in this scenario `kafka0` becomes a leader for all 
partitions, but this is not confirmed yet.

 

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66]