[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

l0co (Jira) Thu, 24 Jun 2021 00:10:07 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368661#comment-17368661
 ]


l0co commented on KAFKA-2729:
-----------------------------

This problem is certainly not fixed in `1.1.0` as we still experience it with 
this Kafka version. This ticket should be reopened, unless the problem is being 
resolved elsewhere (KAFKA-3042, KAFKA-7888?).

Our scenario is the following: we have `kafka0`, `kafka1` and `kafka2` nodes.

1. `kafka0` loses zookeper connection
{code:java}
WARN Unable to reconnect to ZooKeeper service, session 0x27a31276f6d0000 has 
expired (org.apache.zookeeper.ClientCnxn)
INFO Unable to reconnect to ZooKeeper service, session 0x27a31276f6d0000 has 
expired, closing socket connection (org.apache.zookeeper.ClientCnxn)
INFO EventThread shut down for session: 0x27a31276f6d0000 
(org.apache.zookeeper.ClientCnxn)
{code}
2. However, a second later the connection is established properly:
{code:java}
[ZooKeeperClient] Initializing a new session to [...] 
(kafka.zookeeper.ZooKeeperClient)
[2021-06-22 14:06:47,838] INFO Opening socket connection to server [...]. Will 
not attempt to authenticate using SASL (unknown error) 
(org.apache.zookeeper.ClientCnxn)
[2021-06-22 14:06:47,873] INFO Socket connection established to [...], 
initiating session (org.apache.zookeeper.ClientCnxn)
[2021-06-22 14:06:47,933] INFO Creating /brokers/ids/0 (is it secure? false) 
(kafka.zk.KafkaZkClient)
[2021-06-22 14:06:47,959] INFO Session establishment complete on server [...], 
sessionid = 0x27a31276f6d0003, negotiated timeout = 6000 
(org.apache.zookeeper.ClientCnxn)
{code}
3. But a few seconds later `ReplicaFetcherThread` is shut down in `kafka0`:
{code:java}
INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Shutting down 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Stopped 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Shutdown completed 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Shutting down 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Stopped 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Shutdown completed 
(kafka.server.ReplicaFetcherThread)
{code}
We suppose this shutdown is the source of the problem.

4. Now, because of no replication requests from `kafka0` to `kafka1` and 
`kafka2`, `kafka1` and `kafka2` shink ISR list and start to complain about 
zkVersion.
{code:java}
INFO [Partition __consumer_offsets-30 broker=1] Shrinking ISR from 1,2,0 to 1,2 
(kafka.cluster.Partition)
INFO [Partition __consumer_offsets-30 broker=1] Cached zkVersion [212] not 
equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
{code}
This happens forever, until the whole cluster is restarted. Note, that cluster 
state is inconsistent now because `kafka0` stops to be a replica for `kafka1` 
and `kafka2`, but `kafka1` and `kafka2` are still working as replicas for 
`kafka0`. This is due to `ReplicationFetcherThread` has only been stopped in 
`kafka0`.

5. Finally, the whole kafka cluster doesn't work and stops processing events, 
at least for partitions leaded by `kafka0` because of:
{code:java}
ERROR [ReplicaManager broker=0] Error processing append operation on partition 
__consumer_offsets-18 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.NotEnoughReplicasException: Number of insync 
replicas for partition __consumer_offsets-18 is [1], below required minimum [2]
{code}
We also suspect that in this scenario `kafka0` becomes a leader for all 
partitions, but this is not confirmed yet.

 

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2729
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2729
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>            Reporter: Danil Serdyuchenko
>            Assignee: Onur Karaman
>            Priority: Critical
>             Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

Reply via email to