Re: Broker brought down and under replicated partitions

Neha Narkhede Wed, 15 Oct 2014 06:47:32 -0700

Which version of Kafka are you using? The current stable one is 0.8.1.1

On Tue, Oct 14, 2014 at 5:51 PM, Jean-Pascal Billaud <j...@tellapart.com>
wrote:


> Hey Neha,
>
> so I removed another broker like 30mn ago and since then basically the
> Producer is dying with:
>
> Event queue is full of unsent messages, could not send event:
> KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7)
> kafka.common.QueueFullException: Event queue is full of unsent messages,
> could not send event: KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7)
> at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source)
> ~[kafka_2.10-0.8.0.jar:0.8.0]
> at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source)
> ~[kafka_2.10-0.8.0.jar:0.8.0]
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> ~[scala-library-2.10.3.jar:na]
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> ~[scala-library-2.10.3.jar:na]
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> ~[scala-library-2.10.3.jar:na]
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> ~[scala-library-2.10.3.jar:na]
> at kafka.producer.Producer.asyncSend(Unknown Source)
> ~[kafka_2.10-0.8.0.jar:0.8.0]
> at kafka.producer.Producer.send(Unknown Source)
> ~[kafka_2.10-0.8.0.jar:0.8.0]
> at kafka.javaapi.producer.Producer.send(Unknown Source)
> ~[kafka_2.10-0.8.0.jar:0.8.0]
>
> It seems like it cannot recover for some reasons. The new leaders were
> elected it seems like so it should have picked up the new meta data
> information about the partitions. Is this something known from 0.8.0? What
> should be looking for to debug/fix this?
>
> Thanks,
>
> On Tue, Oct 14, 2014 at 2:22 PM, Neha Narkhede <neha.narkh...@gmail.com>
> wrote:
>
> > Regarding (1), I am assuming that it is expected that brokers going down
> > will be brought back up soon. At which point, they will pick up from the
> > current leader and get back into the ISR. Am I right?
> >
> > The broker will be added back to the ISR once it is restarted, but it
> never
> > goes out of the replica list until the admin explicitly moves it using
> the
> > reassign partitions tool.
> >
> > Regarding (2), I finally kicked off a reassign_partitions admin task
> adding
> > broker 7 to the replicas list for partition 0 which finally fixed the
> under
> > replicated issue:
> > Is this therefore expected that the user will fix up the under
> replication
> > situation?
> >
> > Yes. Currently, partition reassignment is purely an admin initiated task.
> >
> > Another thing I'd like to clarify is that for another topic Y, broker 5
> was
> > never removed from the ISR array. Note that Y is an unused topic so I am
> > guessing that technically broker 5 is not out of sync... though it is
> still
> > dead. Is this the expected behavior?
> >
> > Not really. After replica.lag.time.max.ms (which defaults to 10
> seconds),
> > the leader should remove the dead broker out of the ISR.
> >
> > Thanks,
> > Neha
> >
> > On Tue, Oct 14, 2014 at 9:27 AM, Jean-Pascal Billaud <j...@tellapart.com>
> > wrote:
> >
> > > hey folks,
> > >
> > > I have been testing a kafka cluster of 10 nodes on AWS using version
> > > 2.8.0-0.8.0
> > > and see some behavior on failover that I want to make sure I
> understand.
> > >
> > > Initially, I have a topic X with 30 partitions and a replication factor
> > of
> > > 3. Looking at the partition 0:
> > > partition: 0 - leader: 5 preferred leader: 5 brokers: [5, 3, 4]
> in-sync:
> > > [5, 3, 4]
> > >
> > > While killing broker 5, the controller immediately grab the next
> replica
> > in
> > > the ISR and assign it as a leader:
> > > partition: 0 - leader: 3 preferred leader: 5 brokers: [5, 3, 4]
> in-sync:
> > > [3, 4]
> > >
> > > There are couple of things at this point I would like to clarify:
> > >
> > > (1) Why is broker 5 still in the brokers array for partition 0? Note
> this
> > > broker array comes from a get of the zookeeper path
> > /brokers/topics/[topic]
> > > as documented.
> > > (2) Partition 0 is now under replicated and the controller does not
> seem
> > to
> > > do anything about. Is this expected?
> > >
> > > Regarding (1), I am assuming that it is expected that brokers going
> down
> > > will be brought back up soon. At which point, they will pick up from
> the
> > > current leader and get back into the ISR. Am I right?
> > >
> > > Regarding (2), I finally kicked off a reassign_partitions admin task
> > adding
> > > broker 7 to the replicas list for partition 0 which finally fixed the
> > under
> > > replicated issue:
> > >
> > > partition: 0 - leader: 3  expected_leader: 3  brokers: [3, 4, 7]
> > in-sync:
> > > [3, 4, 7]
> > >
> > > Is this therefore expected that the user will fix up the under
> > replication
> > > situation? Or maybe it is expected again that broker 5 will come back
> > soon
> > > and this whole thing is a non-issue once that's true given that
> > > decommissioning brokers is not something supported as of the kafka
> > version
> > > I am using.
> > >
> > > Another thing I'd like to clarify is that for another topic Y, broker 5
> > was
> > > never removed from the ISR array. Note that Y is an unused topic so I
> am
> > > guessing that technically broker 5 is not out of sync... though it is
> > still
> > > dead. Is this the expected behavior?
> > >
> > > I'd really appreciate somebody to confirm my understanding,
> > >
> > > Thanks,
> > >
> >
>

Re: Broker brought down and under replicated partitions

Reply via email to