We have a kafka cluster with 22 nodes, which host  ~3700 topics and ~15000 
partitions.
We ran fine for a long time but, one fine day a bunch of brokers (around half 
of the cluster) started getting out of ISRs with the following messages in the 
leaders:A) 

[2016-01-12 19:01:19,363] INFO Partition [RADM_3600_7,0] on broker 18: 
Shrinking ISR for partition [RADM_3600_7,0] from 18,25 to 18 
(kafka.cluster.Partition)
[2016-01-12 19:01:19,367] INFO Partition [RADM_3600_7,0] on broker 18: Cached 
zkVersion [5] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)

B) Correspondingly, we had messages in Zookeeper Leader which looked like:
Tue Jan 12 19:01:19 2016: 2016-01-12 19:01:19,364 - INFO  [ProcessThread(sid:2 
cport:-1)::PrepRequestProcessor@627] - Got user-level KeeperException when 
processing sessionid:0x251a8968b80f32d type:setData cxid:0x882ade 
zxid:0x501b0a9d1 txntype:-1 reqpath:n/a Error 
Path:/brokers/topics/RADM_3600_7/partitions/0/state Error:KeeperErrorCode = 
BadVersion for /brokers/topics/RADM_3600_7/partitions/0/state


C) In the controller, we were getting messages like: 

[2016-01-12 17:50:15,908] INFO [PreferredReplicaPartitionLeaderSelector]: 
Current leader 25 for partition [RADM_3600_7,0] is not the preferred replica. 
Trigerring preferred replica leader election 
(kafka.controller.PreferredReplicaPartitionLeaderSelector)
[2016-01-12 17:50:15,908] WARN [Controller 17]: Partition [RADM_3600_7,0] 
failed to complete preferred replica leader election. Leader is 25 
(kafka.controller.KafkaController)
before the above shrinking in A) .
Around 70 - 80% of the partitions were operating with only one broker in the 
ISR.We had to clean the state - the data, the topics, everything to finally fix 
this.
Also, we have another deployment which mirrors this one and which has been 
running fine.

G'day,Chiru

Reply via email to