[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368661#comment-17368661 ]
l0co commented on KAFKA-2729: ----------------------------- This problem is certainly not fixed in `1.1.0` as we still experience it with this Kafka version. This ticket should be reopened, unless the problem is being resolved elsewhere (KAFKA-3042, KAFKA-7888?). Our scenario is the following: we have `kafka0`, `kafka1` and `kafka2` nodes. 1. `kafka0` loses zookeper connection {code:java} WARN Unable to reconnect to ZooKeeper service, session 0x27a31276f6d0000 has expired (org.apache.zookeeper.ClientCnxn) INFO Unable to reconnect to ZooKeeper service, session 0x27a31276f6d0000 has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) INFO EventThread shut down for session: 0x27a31276f6d0000 (org.apache.zookeeper.ClientCnxn) {code} 2. However, a second later the connection is established properly: {code:java} [ZooKeeperClient] Initializing a new session to [...] (kafka.zookeeper.ZooKeeperClient) [2021-06-22 14:06:47,838] INFO Opening socket connection to server [...]. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2021-06-22 14:06:47,873] INFO Socket connection established to [...], initiating session (org.apache.zookeeper.ClientCnxn) [2021-06-22 14:06:47,933] INFO Creating /brokers/ids/0 (is it secure? false) (kafka.zk.KafkaZkClient) [2021-06-22 14:06:47,959] INFO Session establishment complete on server [...], sessionid = 0x27a31276f6d0003, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) {code} 3. But a few seconds later `ReplicaFetcherThread` is shut down in `kafka0`: {code:java} INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Shutting down (kafka.server.ReplicaFetcherThread) INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread) INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThread) INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Shutting down (kafka.server.ReplicaFetcherThread) INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread) INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThread) {code} We suppose this shutdown is the source of the problem. 4. Now, because of no replication requests from `kafka0` to `kafka1` and `kafka2`, `kafka1` and `kafka2` shink ISR list and start to complain about zkVersion. {code:java} INFO [Partition __consumer_offsets-30 broker=1] Shrinking ISR from 1,2,0 to 1,2 (kafka.cluster.Partition) INFO [Partition __consumer_offsets-30 broker=1] Cached zkVersion [212] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) {code} This happens forever, until the whole cluster is restarted. Note, that cluster state is inconsistent now because `kafka0` stops to be a replica for `kafka1` and `kafka2`, but `kafka1` and `kafka2` are still working as replicas for `kafka0`. This is due to `ReplicationFetcherThread` has only been stopped in `kafka0`. 5. Finally, the whole kafka cluster doesn't work and stops processing events, at least for partitions leaded by `kafka0` because of: {code:java} ERROR [ReplicaManager broker=0] Error processing append operation on partition __consumer_offsets-18 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.NotEnoughReplicasException: Number of insync replicas for partition __consumer_offsets-18 is [1], below required minimum [2] {code} We also suspect that in this scenario `kafka0` becomes a leader for all partitions, but this is not confirmed yet. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > ----------------------------------------------------------------------- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1 > Reporter: Danil Serdyuchenko > Assignee: Onur Karaman > Priority: Critical > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian Jira (v8.3.4#803005)