Hello, I have a three broker Kafka setup (the ids are 1, 2 (kafka 0.10.1.0) and 1001 (kafka 0.10.0.0)). After a failure of two of them a lot of the partitions have the third one (1001) as their leader. It's like this:
Topic: userevents0.open Partition: 5 Leader: 1 Replicas: 1,2,1001 Isr: 1,1001,2 Topic: userevents0.open Partition: 6 Leader: 2 Replicas: 2,1,1001 Isr: 1,2,1001 Topic: userevents0.open Partition: 7 Leader: 1001 Replicas: 1001,2,1 Isr: 1001 Topic: userevents0.open Partition: 8 Leader: 1 Replicas: 1,1001,2 Isr: 1,1001,2 Topic: userevents0.open Partition: 9 Leader: 1001 Replicas: 2,1001,1 Isr: 1001 Topic: userevents0.open Partition: 10 Leader: 1001 Replicas: 1001,1,2 Isr: 1001 As you can see, only the partitions with Leaders 1 or 2 have successfully replicated. Brokers 1 and 2, however, are unable to fetch data from the 1001. All of the partitions are available to the consumers and producers. So everything is fine except replication. 1001 is available from the other servers. I can't restart the broker 1001 because it seems that it will cause data loss (as you can see, it's the only ISR on many partitions). Restarting the other brokers didn't help at all. Neither did just plain waiting (it's the third day of this going on). So what do I do? The logs of the broker 2 (the one which tries to fetch data) are full of this: [2016-12-22 16:38:52,199] WARN [ReplicaFetcherThread-0-1001], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@117a49bf (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 1001 was disconnected before the response was read at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112) at scala.Option.foreach(Option.scala:257) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108) at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137) at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143) at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108) at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) The logs of the broker 1001 are full of this: [2016-12-22 16:38:54,226] ERROR Processor got uncaught exception. (kafka.network.Processor) java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:506) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:361) at kafka.api.FetchRequest$$anonfun$1$$anonfun$apply$1.apply(FetchRequest.scala:55) at kafka.api.FetchRequest$$anonfun$1$$anonfun$apply$1.apply(FetchRequest.scala:52) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at kafka.api.FetchRequest$$anonfun$1.apply(FetchRequest.scala:52) at kafka.api.FetchRequest$$anonfun$1.apply(FetchRequest.scala:49) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at kafka.api.FetchRequest$.readFrom(FetchRequest.scala:49) at kafka.network.RequestChannel$Request$$anonfun$2.apply(RequestChannel.scala:65) at kafka.network.RequestChannel$Request$$anonfun$2.apply(RequestChannel.scala:65) at kafka.network.RequestChannel$Request$$anonfun$4.apply(RequestChannel.scala:71) at kafka.network.RequestChannel$Request$$anonfun$4.apply(RequestChannel.scala:71) at scala.Option.map(Option.scala:146) at kafka.network.RequestChannel$Request.<init>(RequestChannel.scala:71) at kafka.network.Processor$$anonfun$processCompletedReceives$1.apply(SocketServer.scala:488) at kafka.network.Processor$$anonfun$processCompletedReceives$1.apply(SocketServer.scala:483) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at kafka.network.Processor.processCompletedReceives(SocketServer.scala:483) at kafka.network.Processor.run(SocketServer.scala:413) at java.lang.Thread.run(Thread.java:745)