Hello, I would like to get some help/advise on some issues I am having with my kafka cluster.
I am running kafka (kafka_2.11-0.10.1.0) on a 5 broker cluster (ubuntu 16.04) configuration is here: http://pastebin.com/cPch8Kd7 today one of the 5 brokers (id: 1) appeared to disconnect from the others: The log shows this around that time [2016-12-28 16:18:30,575] INFO Partition [aki_reload5yl_5,11] on broker 1: Shrinking ISR for partition [aki_reload5yl_5,11] from 2,3,1 to 1 (kafka.cluster.Partition) [2016-12-28 16:18:30,579] INFO Partition [ale_reload5yl_1,0] on broker 1: Shrinking ISR for partition [ale_reload5yl_1,0] from 5,1,2 to 1 (kafka.cluster.Partition) [2016-12-28 16:18:30,580] INFO Partition [hl7_staging,17] on broker 1: Shrinking ISR for partition [hl7_staging,17] from 4,1,5 to 1 (kafka.cluster.Partition) [2016-12-28 16:18:30,581] INFO Partition [hes_reload_5,37] on broker 1: Shrinking ISR for partition [hes_reload_5,37] from 1,2,5 to 1 (kafka.cluster.Partition) [2016-12-28 16:18:30,582] INFO Partition [aki_live,38] on broker 1: Shrinking ISR for partition [aki_live,38] from 5,2,1 to 1 (kafka.cluster.Partition) [2016-12-28 16:18:30,582] INFO Partition [hl7_live,51] on broker 1: Shrinking ISR for partition [hl7_live,51] from 1,3,4 to 1 (kafka.cluster.Partition) (other hosts had) java.io.IOException: Connection to 1 was disconnected before the response was read at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112) at scala.Option.foreach(Option.scala:257) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108) at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137) at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143) at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108) at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) while this was happening, the ConsumerOffsetChecker was reporting only few of the 128 partitions configured for some of the topics, and consumers started crashing. I then used KafkaManager to reassign partitions from broker 1 to other brokers. I could then see on the kafka1 log the following errors [2016-12-28 17:23:51,816] ERROR [ReplicaFetcherThread-0-4], Error for partition [aki_live,86] to broker 4:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2016-12-28 17:23:51,817] ERROR [ReplicaFetcherThread-0-4], Error for partition [aki_live,21] to broker 4:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2016-12-28 17:23:51,817] ERROR [ReplicaFetcherThread-0-4], Error for partition [aki_live,126] to broker 4:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2016-12-28 17:23:51,818] ERROR [ReplicaFetcherThread-0-4], Error for partition [aki_live,6] to broker 4:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) I thought I would restart broker1, but as soon as I did, most of my topic ended up with some empty partitions, and their consumer offsets were wiped out completely. I understand that because of unclean.leader.election.enable = true an unclean leader would be elected, but why were the partition wiped out if there were at least 3 replicas for each? What do you thin caused the disconnection in the first place, and how can I recover from situations like this in the future? Regards Alessandro -- Alessandro De Maria alessandro.dema...@gmail.com