Unrecoverable kafka cluster after failed rolling restart

Andrey Falko Wed, 17 Jan 2018 20:02:01 -0800

Hi everyone,

Fortunately this didn't happen with a production cluster, but here is
the sequence of events with this 10 broker cluster:
1) Stop all kafka brokers servers (this was a mistake on my part; I
intended to do a rolling restart)
2) Start 5 brokers a few minutes apart each
3) Start the rest of the brokers at roughly the same time
4) While watching my metrics, leader election started happening,
zookeepers started getting busier; then there was a drop off zk
utilization.
5) At that point, kafkas remained busy. Describe topics showed that
all 62k 3x replicated partitions had leaders, all had their 3
replicas, but 45k had their full ISR, 14k had 2 ISR, 3k 1 ISR.
6) After waiting for 30 minutes, the ISR numbers didn't change.


Zookeeper logs spew this:
[2018-01-18 03:18:56,645] INFO Accepted socket connection from
/172.16.253.90:51082
(org.apache.zookeeper.server.NIOServerCnxnFactory)
[2018-01-18 03:18:56,645] INFO Client attempting to renew session
0xab61063d85310002 at /172.16.253.90:51082
(org.apache.zookeeper.server.ZooKeeperServer)
[2018-01-18 03:18:56,645] INFO Established session 0xab61063d85310002
with negotiated timeout 6000 for client /172.16.253.90:51082
(org.apache.zookeeper.server.ZooKeeperServer)
[2018-01-18 03:18:56,647] WARN Exception causing close of session
0xab61063d85310002 due to java.io.IOException: Len error 1251065
(org.apache.zookeeper.server.NIOServerCnxn)
[2018-01-18 03:18:56,647] INFO Closed socket connection for client
/172.16.253.90:51082 which had sessionid 0xab61063d85310002
(org.apache.zookeeper.server.NIOServerCnxn)

Kafka logs spew this:
[2018-01-18 03:19:57,751] INFO Session establishment complete on
server ip-172-16-253-167.us-west-1.compute.internal/172.16.253.167:2181,
sessionid = 0xab61063d85310002, negotiated
timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2018-01-18 03:19:57,751] INFO zookeeper state changed (SyncConnected)
(org.I0Itec.zkclient.ZkClient)
[2018-01-18 03:19:57,753] INFO Unable to read additional data from
server sessionid 0xab61063d85310002, likely server has closed socket,
closing socket connection and attempting rec
onnect (org.apache.zookeeper.ClientCnxn)
[2018-01-18 03:19:57,854] INFO zookeeper state changed (Disconnected)
(org.I0Itec.zkclient.ZkClient)
[2018-01-18 03:19:57,854] INFO Waiting for keeper state SyncConnected
(org.I0Itec.zkclient.ZkClient)

The only thing that I found is the following link, but I'm running
kafka-1.0 bundled ZK, which is 3.4 and has that fix described in it:
https://community.hortonworks.com/content/supportkb/150660/error-javaioioexception-len-error-in-zookeeper-cau.html

Unfortunately, I am unable to consume or produce to any of the
existing topics. I was able to create new topics and produce to them
at reasonable rates with the environment in the bad state. Consume
from these new topics didn't work.

How should I proceed from here? For some extra context: I'm
researching how many replicated partitions kafka can support before it
falls over and/or starts failing to meet various SLAs we've made with
our users. I'll try to "correct" the original mistake that I made,
however, I hope that an operator error like mine doesn't take out a
production cluster this like this :).

Best regards,
Andrey Falko
Salesforce.com

Unrecoverable kafka cluster after failed rolling restart

Reply via email to