Topic partitions randomly failed on live system

Simon Cooper Mon, 17 Aug 2015 04:32:03 -0700

Hi,

We've had an issue on a live system (3 brokers, ~10 topics, some replicated, 
some partitioned) where a partition wasn't properly reassigned, causing several 
other partitions to go down.


First, this exception happened on broker 1 (we weren't doing anything 
particular on the system at the time):

ERROR [AddPartitionsListener on 1]: Error while handling add partitions for 
data path /brokers/topics/topic1 
(kafka.controller.PartitionStateMachine$AddPartitionsListener)
java.util.NoSuchElementException: key not found: [topic1,0]
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:58)
        at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
        at 
kafka.controller.ControllerContext$$anonfun$replicasForPartition$1.apply(KafkaController.scala:112)
        at 
kafka.controller.ControllerContext$$anonfun$replicasForPartition$1.apply(KafkaController.scala:111)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
       at 
scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
        at scala.collection.SetLike$class.map(SetLike.scala:93)
        at scala.collection.AbstractSet.map(Set.scala:47)
        at 
kafka.controller.ControllerContext.replicasForPartition(KafkaController.scala:111)
        at 
kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:485)
        at 
kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply$mcV$sp(PartitionStateMachine.scala:530)
        at 
kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply(PartitionStateMachine.scala:519)
        at 
kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply(PartitionStateMachine.scala:519)
        at kafka.utils.Utils$.inLock(Utils.scala:535)
        at 
kafka.controller.PartitionStateMachine$AddPartitionsListener.handleDataChange(PartitionStateMachine.scala:518)
        at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:547)
        at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)

At this point, broker 2 started continually spammed these messages (mentioning 
other topics, not just topic1):

ERROR [ReplicaFetcherThread-0-1], Error for partition [othertopic1,2] to broker 
1:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-1], Error for partition [othertopic2,0] to broker 
1:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-1], Error for partition [othertopic3,0] to broker 
1:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-1], Error for partition [topic1,0] to broker 
1:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread)

And broker 1 had these messages, but only for topic1:

ERROR [KafkaApi-1] error when handling request Name: FetchRequest; Version: 0; 
CorrelationId: 41182755; ClientId: ReplicaFetcherThread-0-1; ReplicaId: 2; 
MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [topic1,0] -> 
PartitionFetchInfo(0,1048576) (kafka.server.KafkaApis)
kafka.common.NotAssignedReplicaException: Leader 1 failed to record follower 
2's position 0 since the replica is not recognized to be one of the assigned 
replicas 1 for partition [topic1,0]
        at 
kafka.server.ReplicaManager.updateReplicaLEOAndPartitionHW(ReplicaManager.scala:574)
        at 
kafka.server.KafkaApis$$anonfun$recordFollowerLogEndOffsets$2.apply(KafkaApis.scala:388)
        at 
kafka.server.KafkaApis$$anonfun$recordFollowerLogEndOffsets$2.apply(KafkaApis.scala:386)
        at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
        at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
        at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
        at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
        at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245)
        at 
kafka.server.KafkaApis.recordFollowerLogEndOffsets(KafkaApis.scala:386)
        at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:351)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
        at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59)
        at java.lang.Thread.run(Thread.java:745)

At this time, any topic that had broker 1 as a leader were not working. ZK 
thought that everything was ok and in sync.

Restarting broker 1 fixed the broken topics for a bit, until broker 1 was 
reassigned as leader of some topics, at which point it broke again.
Restarting broker 2 fixed it (!!!!).

We're using kafka-2.10.0_0.8.2.0. Could anyone explain what happened, and (most 
importantly) how we stop it happening again in the future?

Many thanks,
SimonC

Topic partitions randomly failed on live system

Reply via email to