Hello,

I would like to get some help/advise on some issues I am having with my
kafka cluster.

I am running kafka (kafka_2.11-0.10.1.0) on a 5 broker cluster (ubuntu
16.04)

configuration is here: http://pastebin.com/cPch8Kd7

today one of the 5 brokers (id: 1) appeared to disconnect from the others:

The log shows this around that time
[2016-12-28 16:18:30,575] INFO Partition [aki_reload5yl_5,11] on broker 1:
Shrinking ISR for partition [aki_reload5yl_5,11] from 2,3,1 to 1
(kafka.cluster.Partition)
[2016-12-28 16:18:30,579] INFO Partition [ale_reload5yl_1,0] on broker 1:
Shrinking ISR for partition [ale_reload5yl_1,0] from 5,1,2 to 1
(kafka.cluster.Partition)
[2016-12-28 16:18:30,580] INFO Partition [hl7_staging,17] on broker 1:
Shrinking ISR for partition [hl7_staging,17] from 4,1,5 to 1
(kafka.cluster.Partition)
[2016-12-28 16:18:30,581] INFO Partition [hes_reload_5,37] on broker 1:
Shrinking ISR for partition [hes_reload_5,37] from 1,2,5 to 1
(kafka.cluster.Partition)
[2016-12-28 16:18:30,582] INFO Partition [aki_live,38] on broker 1:
Shrinking ISR for partition [aki_live,38] from 5,2,1 to 1
(kafka.cluster.Partition)
[2016-12-28 16:18:30,582] INFO Partition [hl7_live,51] on broker 1:
Shrinking ISR for partition [hl7_live,51] from 1,3,4 to 1
(kafka.cluster.Partition)

(other hosts had)
java.io.IOException: Connection to 1 was disconnected before the response
was read
        at
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
        at
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
        at scala.Option.foreach(Option.scala:257)
        at
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
        at
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
        at
kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
        at
kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
        at
kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
        at
kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
        at
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
        at
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
        at
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
        at
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)


while this was happening, the ConsumerOffsetChecker was reporting only few
of the 128 partitions configured for some of the topics, and consumers
started crashing.

I then used KafkaManager to reassign partitions from broker 1 to other
brokers.

I could then see on the kafka1 log the following errors
[2016-12-28 17:23:51,816] ERROR [ReplicaFetcherThread-0-4], Error for
partition [aki_live,86] to broker
4:org.apache.kafka.common.errors.UnknownServerException: The server
experienced an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2016-12-28 17:23:51,817] ERROR [ReplicaFetcherThread-0-4], Error for
partition [aki_live,21] to broker
4:org.apache.kafka.common.errors.UnknownServerException: The server
experienced an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2016-12-28 17:23:51,817] ERROR [ReplicaFetcherThread-0-4], Error for
partition [aki_live,126] to broker
4:org.apache.kafka.common.errors.UnknownServerException: The server
experienced an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2016-12-28 17:23:51,818] ERROR [ReplicaFetcherThread-0-4], Error for
partition [aki_live,6] to broker
4:org.apache.kafka.common.errors.UnknownServerException: The server
experienced an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)


I thought I would restart broker1, but as soon as I did, most of my topic
ended up with some empty partitions, and their consumer offsets were wiped
out completely.

I understand that because of unclean.leader.election.enable = true an
unclean leader would be elected, but why were the partition wiped out if
there were at least 3 replicas for each?

What do you thin caused the disconnection in the first place, and how can I
recover from situations like this in the future?

Regards
Alessandro





-- 
Alessandro De Maria
alessandro.dema...@gmail.com

Reply via email to