Hi everyone, Fortunately this didn't happen with a production cluster, but here is the sequence of events with this 10 broker cluster: 1) Stop all kafka brokers servers (this was a mistake on my part; I intended to do a rolling restart) 2) Start 5 brokers a few minutes apart each 3) Start the rest of the brokers at roughly the same time 4) While watching my metrics, leader election started happening, zookeepers started getting busier; then there was a drop off zk utilization. 5) At that point, kafkas remained busy. Describe topics showed that all 62k 3x replicated partitions had leaders, all had their 3 replicas, but 45k had their full ISR, 14k had 2 ISR, 3k 1 ISR. 6) After waiting for 30 minutes, the ISR numbers didn't change.
Zookeeper logs spew this: [2018-01-18 03:18:56,645] INFO Accepted socket connection from /172.16.253.90:51082 (org.apache.zookeeper.server.NIOServerCnxnFactory) [2018-01-18 03:18:56,645] INFO Client attempting to renew session 0xab61063d85310002 at /172.16.253.90:51082 (org.apache.zookeeper.server.ZooKeeperServer) [2018-01-18 03:18:56,645] INFO Established session 0xab61063d85310002 with negotiated timeout 6000 for client /172.16.253.90:51082 (org.apache.zookeeper.server.ZooKeeperServer) [2018-01-18 03:18:56,647] WARN Exception causing close of session 0xab61063d85310002 due to java.io.IOException: Len error 1251065 (org.apache.zookeeper.server.NIOServerCnxn) [2018-01-18 03:18:56,647] INFO Closed socket connection for client /172.16.253.90:51082 which had sessionid 0xab61063d85310002 (org.apache.zookeeper.server.NIOServerCnxn) Kafka logs spew this: [2018-01-18 03:19:57,751] INFO Session establishment complete on server ip-172-16-253-167.us-west-1.compute.internal/172.16.253.167:2181, sessionid = 0xab61063d85310002, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2018-01-18 03:19:57,751] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) [2018-01-18 03:19:57,753] INFO Unable to read additional data from server sessionid 0xab61063d85310002, likely server has closed socket, closing socket connection and attempting rec onnect (org.apache.zookeeper.ClientCnxn) [2018-01-18 03:19:57,854] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2018-01-18 03:19:57,854] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient) The only thing that I found is the following link, but I'm running kafka-1.0 bundled ZK, which is 3.4 and has that fix described in it: https://community.hortonworks.com/content/supportkb/150660/error-javaioioexception-len-error-in-zookeeper-cau.html Unfortunately, I am unable to consume or produce to any of the existing topics. I was able to create new topics and produce to them at reasonable rates with the environment in the bad state. Consume from these new topics didn't work. How should I proceed from here? For some extra context: I'm researching how many replicated partitions kafka can support before it falls over and/or starts failing to meet various SLAs we've made with our users. I'll try to "correct" the original mistake that I made, however, I hope that an operator error like mine doesn't take out a production cluster this like this :). Best regards, Andrey Falko Salesforce.com