Re: Broker brought down and under replicated partitions

Jean-Pascal Billaud Tue, 14 Oct 2014 17:52:12 -0700

Hey Neha,

so I removed another broker like 30mn ago and since then basically the
Producer is dying with:


Event queue is full of unsent messages, could not send event:
KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7)
kafka.common.QueueFullException: Event queue is full of unsent messages,
could not send event: KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7)
at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source)
~[kafka_2.10-0.8.0.jar:0.8.0]
at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source)
~[kafka_2.10-0.8.0.jar:0.8.0]
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
~[scala-library-2.10.3.jar:na]
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
~[scala-library-2.10.3.jar:na]
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
~[scala-library-2.10.3.jar:na]
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
~[scala-library-2.10.3.jar:na]
at kafka.producer.Producer.asyncSend(Unknown Source)
~[kafka_2.10-0.8.0.jar:0.8.0]
at kafka.producer.Producer.send(Unknown Source)
~[kafka_2.10-0.8.0.jar:0.8.0]
at kafka.javaapi.producer.Producer.send(Unknown Source)
~[kafka_2.10-0.8.0.jar:0.8.0]

It seems like it cannot recover for some reasons. The new leaders were
elected it seems like so it should have picked up the new meta data
information about the partitions. Is this something known from 0.8.0? What
should be looking for to debug/fix this?

Thanks,

On Tue, Oct 14, 2014 at 2:22 PM, Neha Narkhede <neha.narkh...@gmail.com>
wrote:

> Regarding (1), I am assuming that it is expected that brokers going down
> will be brought back up soon. At which point, they will pick up from the
> current leader and get back into the ISR. Am I right?
>
> The broker will be added back to the ISR once it is restarted, but it never
> goes out of the replica list until the admin explicitly moves it using the
> reassign partitions tool.
>
> Regarding (2), I finally kicked off a reassign_partitions admin task adding
> broker 7 to the replicas list for partition 0 which finally fixed the under
> replicated issue:
> Is this therefore expected that the user will fix up the under replication
> situation?
>
> Yes. Currently, partition reassignment is purely an admin initiated task.
>
> Another thing I'd like to clarify is that for another topic Y, broker 5 was
> never removed from the ISR array. Note that Y is an unused topic so I am
> guessing that technically broker 5 is not out of sync... though it is still
> dead. Is this the expected behavior?
>
> Not really. After replica.lag.time.max.ms (which defaults to 10 seconds),
> the leader should remove the dead broker out of the ISR.
>
> Thanks,
> Neha
>
> On Tue, Oct 14, 2014 at 9:27 AM, Jean-Pascal Billaud <j...@tellapart.com>
> wrote:
>
> > hey folks,
> >
> > I have been testing a kafka cluster of 10 nodes on AWS using version
> > 2.8.0-0.8.0
> > and see some behavior on failover that I want to make sure I understand.
> >
> > Initially, I have a topic X with 30 partitions and a replication factor
> of
> > 3. Looking at the partition 0:
> > partition: 0 - leader: 5 preferred leader: 5 brokers: [5, 3, 4] in-sync:
> > [5, 3, 4]
> >
> > While killing broker 5, the controller immediately grab the next replica
> in
> > the ISR and assign it as a leader:
> > partition: 0 - leader: 3 preferred leader: 5 brokers: [5, 3, 4] in-sync:
> > [3, 4]
> >
> > There are couple of things at this point I would like to clarify:
> >
> > (1) Why is broker 5 still in the brokers array for partition 0? Note this
> > broker array comes from a get of the zookeeper path
> /brokers/topics/[topic]
> > as documented.
> > (2) Partition 0 is now under replicated and the controller does not seem
> to
> > do anything about. Is this expected?
> >
> > Regarding (1), I am assuming that it is expected that brokers going down
> > will be brought back up soon. At which point, they will pick up from the
> > current leader and get back into the ISR. Am I right?
> >
> > Regarding (2), I finally kicked off a reassign_partitions admin task
> adding
> > broker 7 to the replicas list for partition 0 which finally fixed the
> under
> > replicated issue:
> >
> > partition: 0 - leader: 3  expected_leader: 3  brokers: [3, 4, 7]
> in-sync:
> > [3, 4, 7]
> >
> > Is this therefore expected that the user will fix up the under
> replication
> > situation? Or maybe it is expected again that broker 5 will come back
> soon
> > and this whole thing is a non-issue once that's true given that
> > decommissioning brokers is not something supported as of the kafka
> version
> > I am using.
> >
> > Another thing I'd like to clarify is that for another topic Y, broker 5
> was
> > never removed from the ISR array. Note that Y is an unused topic so I am
> > guessing that technically broker 5 is not out of sync... though it is
> still
> > dead. Is this the expected behavior?
> >
> > I'd really appreciate somebody to confirm my understanding,
> >
> > Thanks,
> >
>

Re: Broker brought down and under replicated partitions

Reply via email to