Hi,

We've had an issue on a live system (3 brokers, ~10 topics, some replicated, 
some partitioned) where a partition wasn't properly reassigned, causing several 
other partitions to go down.

First, this exception happened on broker 1 (we weren't doing anything 
particular on the system at the time):

ERROR [AddPartitionsListener on 1]: Error while handling add partitions for 
data path /brokers/topics/topic1 
(kafka.controller.PartitionStateMachine$AddPartitionsListener)
java.util.NoSuchElementException: key not found: [topic1,0]
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:58)
        at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
        at 
kafka.controller.ControllerContext$$anonfun$replicasForPartition$1.apply(KafkaController.scala:112)
        at 
kafka.controller.ControllerContext$$anonfun$replicasForPartition$1.apply(KafkaController.scala:111)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
       at 
scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
        at scala.collection.SetLike$class.map(SetLike.scala:93)
        at scala.collection.AbstractSet.map(Set.scala:47)
        at 
kafka.controller.ControllerContext.replicasForPartition(KafkaController.scala:111)
        at 
kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:485)
        at 
kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply$mcV$sp(PartitionStateMachine.scala:530)
        at 
kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply(PartitionStateMachine.scala:519)
        at 
kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply(PartitionStateMachine.scala:519)
        at kafka.utils.Utils$.inLock(Utils.scala:535)
        at 
kafka.controller.PartitionStateMachine$AddPartitionsListener.handleDataChange(PartitionStateMachine.scala:518)
        at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:547)
        at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)

At this point, broker 2 started continually spammed these messages (mentioning 
other topics, not just topic1):

ERROR [ReplicaFetcherThread-0-1], Error for partition [othertopic1,2] to broker 
1:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-1], Error for partition [othertopic2,0] to broker 
1:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-1], Error for partition [othertopic3,0] to broker 
1:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-1], Error for partition [topic1,0] to broker 
1:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread)

And broker 1 had these messages, but only for topic1:

ERROR [KafkaApi-1] error when handling request Name: FetchRequest; Version: 0; 
CorrelationId: 41182755; ClientId: ReplicaFetcherThread-0-1; ReplicaId: 2; 
MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [topic1,0] -> 
PartitionFetchInfo(0,1048576) (kafka.server.KafkaApis)
kafka.common.NotAssignedReplicaException: Leader 1 failed to record follower 
2's position 0 since the replica is not recognized to be one of the assigned 
replicas 1 for partition [topic1,0]
        at 
kafka.server.ReplicaManager.updateReplicaLEOAndPartitionHW(ReplicaManager.scala:574)
        at 
kafka.server.KafkaApis$$anonfun$recordFollowerLogEndOffsets$2.apply(KafkaApis.scala:388)
        at 
kafka.server.KafkaApis$$anonfun$recordFollowerLogEndOffsets$2.apply(KafkaApis.scala:386)
        at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
        at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
        at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
        at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
        at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245)
        at 
kafka.server.KafkaApis.recordFollowerLogEndOffsets(KafkaApis.scala:386)
        at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:351)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
        at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59)
        at java.lang.Thread.run(Thread.java:745)

At this time, any topic that had broker 1 as a leader were not working. ZK 
thought that everything was ok and in sync.

Restarting broker 1 fixed the broken topics for a bit, until broker 1 was 
reassigned as leader of some topics, at which point it broke again.
Restarting broker 2 fixed it (!!!!).

We're using kafka-2.10.0_0.8.2.0. Could anyone explain what happened, and (most 
importantly) how we stop it happening again in the future?

Many thanks,
SimonC

Reply via email to