[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879124#comment-15879124 ]
Grant Henke commented on KAFKA-2729: ------------------------------------ I am curious if everyone on this Jira is actually seeing the reported issue. I have had multiple cases where someone presented my with an environment they thought was experiencing this issue. After researching the environment and logs, to date it has always been something else. The main culprits so far have been: * Long GC pauses causing zookeeper sessions to timeout * Slow or poorly configured zookeeper * Bad network configuration All of the above resulted in a soft reoccurring failure of brokers. That churn often caused addition load perpetuating the issue. If you are seeing this issue do you see the following pattern repeating in the logs?: {noformat} INFO org.I0Itec.zkclient.ZkClient: zookeeper state changed (Disconnected) ... INFO org.I0Itec.zkclient.ZkClient: zookeeper state changed (Expired) INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x153ab38abdbd360 has expired, closing socket connection ... INFO org.I0Itec.zkclient.ZkClient: zookeeper state changed (SyncConnected) INFO kafka.server.KafkaHealthcheck: re-registering broker info in ZK for broker 32 INFO kafka.utils.ZKCheckedEphemeral: Creating /brokers/ids/32 (is it secure? false) INFO kafka.utils.ZKCheckedEphemeral: Result of znode creation is: OK {noformat} If so, something is causing communication with zookeeper to take too long and the broker is unregistering itself. This will cause ISRs to shrink and expand over and over again. I don't think this will solve everyones issue here, but hopefully it will help solve some. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > ----------------------------------------------------------------------- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.2.1 > Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)