Hi,
We've had an issue on a live system (3 brokers, ~10 topics, some replicated,
some partitioned) where a partition wasn't properly reassigned, causing several
other partitions to go down.
First, this exception happened on broker 1 (we weren't doing anything
particular on the system at the time):
ERROR [AddPartitionsListener on 1]: Error while handling add partitions for
data path /brokers/topics/topic1
(kafka.controller.PartitionStateMachine$AddPartitionsListener)
java.util.NoSuchElementException: key not found: [topic1,0]
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at
kafka.controller.ControllerContext$$anonfun$replicasForPartition$1.apply(KafkaController.scala:112)
at
kafka.controller.ControllerContext$$anonfun$replicasForPartition$1.apply(KafkaController.scala:111)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at
scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
at scala.collection.SetLike$class.map(SetLike.scala:93)
at scala.collection.AbstractSet.map(Set.scala:47)
at
kafka.controller.ControllerContext.replicasForPartition(KafkaController.scala:111)
at
kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:485)
at
kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply$mcV$sp(PartitionStateMachine.scala:530)
at
kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply(PartitionStateMachine.scala:519)
at
kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply(PartitionStateMachine.scala:519)
at kafka.utils.Utils$.inLock(Utils.scala:535)
at
kafka.controller.PartitionStateMachine$AddPartitionsListener.handleDataChange(PartitionStateMachine.scala:518)
at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:547)
at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
At this point, broker 2 started continually spammed these messages (mentioning
other topics, not just topic1):
ERROR [ReplicaFetcherThread-0-1], Error for partition [othertopic1,2] to broker
1:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-1], Error for partition [othertopic2,0] to broker
1:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-1], Error for partition [othertopic3,0] to broker
1:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-1], Error for partition [topic1,0] to broker
1:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread)
And broker 1 had these messages, but only for topic1:
ERROR [KafkaApi-1] error when handling request Name: FetchRequest; Version: 0;
CorrelationId: 41182755; ClientId: ReplicaFetcherThread-0-1; ReplicaId: 2;
MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [topic1,0] ->
PartitionFetchInfo(0,1048576) (kafka.server.KafkaApis)
kafka.common.NotAssignedReplicaException: Leader 1 failed to record follower
2's position 0 since the replica is not recognized to be one of the assigned
replicas 1 for partition [topic1,0]
at
kafka.server.ReplicaManager.updateReplicaLEOAndPartitionHW(ReplicaManager.scala:574)
at
kafka.server.KafkaApis$$anonfun$recordFollowerLogEndOffsets$2.apply(KafkaApis.scala:388)
at
kafka.server.KafkaApis$$anonfun$recordFollowerLogEndOffsets$2.apply(KafkaApis.scala:386)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245)
at
kafka.server.KafkaApis.recordFollowerLogEndOffsets(KafkaApis.scala:386)
at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:351)
at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59)
at java.lang.Thread.run(Thread.java:745)
At this time, any topic that had broker 1 as a leader were not working. ZK
thought that everything was ok and in sync.
Restarting broker 1 fixed the broken topics for a bit, until broker 1 was
reassigned as leader of some topics, at which point it broke again.
Restarting broker 2 fixed it (!!!!).
We're using kafka-2.10.0_0.8.2.0. Could anyone explain what happened, and (most
importantly) how we stop it happening again in the future?
Many thanks,
SimonC