Re: Broker brought down and under replicated partitions

Neha Narkhede Thu, 16 Oct 2014 07:01:30 -0700

Is there a known issue in the 0.8.0 version that was
fixed later on? What can I do to diagnose/fix the situation?


Yes, quite a few bugs related to this have been fixed since 0.8.0. I'd
suggest upgrading to 0.8.1.1

On Wed, Oct 15, 2014 at 11:09 PM, Jean-Pascal Billaud <j...@tellapart.com>
wrote:

> The only thing that I find very weird is the fact that brokers that are
> dead are still part of the ISR set for hours... and are basically not
> removed. Note this is not constantly the case, most of the dead brokers are
> properly removed and it is really just in a few cases. I am not sure why
> this would happen. Is there a known issue in the 0.8.0 version that was
> fixed later on? What can I do to diagnose/fix the situation?
>
> Thanks,
>
> On Wed, Oct 15, 2014 at 9:58 AM, Jean-Pascal Billaud <j...@tellapart.com>
> wrote:
>
> > So I am using 0.8.0. I think I found the issue actually. It turns out
> that
> > some partitions only had a single replica and the leaders of those
> > partitions would basically "refuse" new writes. As soon as I reassigned
> > replicas to those partitions things kicked off again. Not sure if that's
> > expected... but that seemed to make the problem go away.
> >
> > Thanks,
> >
> >
> > On Wed, Oct 15, 2014 at 6:46 AM, Neha Narkhede <neha.narkh...@gmail.com>
> > wrote:
> >
> >> Which version of Kafka are you using? The current stable one is 0.8.1.1
> >>
> >> On Tue, Oct 14, 2014 at 5:51 PM, Jean-Pascal Billaud <j...@tellapart.com>
> >> wrote:
> >>
> >> > Hey Neha,
> >> >
> >> > so I removed another broker like 30mn ago and since then basically the
> >> > Producer is dying with:
> >> >
> >> > Event queue is full of unsent messages, could not send event:
> >> > KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7)
> >> > kafka.common.QueueFullException: Event queue is full of unsent
> messages,
> >> > could not send event: KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7)
> >> > at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source)
> >> > ~[kafka_2.10-0.8.0.jar:0.8.0]
> >> > at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source)
> >> > ~[kafka_2.10-0.8.0.jar:0.8.0]
> >> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> >> > ~[scala-library-2.10.3.jar:na]
> >> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> >> > ~[scala-library-2.10.3.jar:na]
> >> > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> >> > ~[scala-library-2.10.3.jar:na]
> >> > at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> >> > ~[scala-library-2.10.3.jar:na]
> >> > at kafka.producer.Producer.asyncSend(Unknown Source)
> >> > ~[kafka_2.10-0.8.0.jar:0.8.0]
> >> > at kafka.producer.Producer.send(Unknown Source)
> >> > ~[kafka_2.10-0.8.0.jar:0.8.0]
> >> > at kafka.javaapi.producer.Producer.send(Unknown Source)
> >> > ~[kafka_2.10-0.8.0.jar:0.8.0]
> >> >
> >> > It seems like it cannot recover for some reasons. The new leaders were
> >> > elected it seems like so it should have picked up the new meta data
> >> > information about the partitions. Is this something known from 0.8.0?
> >> What
> >> > should be looking for to debug/fix this?
> >> >
> >> > Thanks,
> >> >
> >> > On Tue, Oct 14, 2014 at 2:22 PM, Neha Narkhede <
> neha.narkh...@gmail.com
> >> >
> >> > wrote:
> >> >
> >> > > Regarding (1), I am assuming that it is expected that brokers going
> >> down
> >> > > will be brought back up soon. At which point, they will pick up from
> >> the
> >> > > current leader and get back into the ISR. Am I right?
> >> > >
> >> > > The broker will be added back to the ISR once it is restarted, but
> it
> >> > never
> >> > > goes out of the replica list until the admin explicitly moves it
> using
> >> > the
> >> > > reassign partitions tool.
> >> > >
> >> > > Regarding (2), I finally kicked off a reassign_partitions admin task
> >> > adding
> >> > > broker 7 to the replicas list for partition 0 which finally fixed
> the
> >> > under
> >> > > replicated issue:
> >> > > Is this therefore expected that the user will fix up the under
> >> > replication
> >> > > situation?
> >> > >
> >> > > Yes. Currently, partition reassignment is purely an admin initiated
> >> task.
> >> > >
> >> > > Another thing I'd like to clarify is that for another topic Y,
> broker
> >> 5
> >> > was
> >> > > never removed from the ISR array. Note that Y is an unused topic so
> I
> >> am
> >> > > guessing that technically broker 5 is not out of sync... though it
> is
> >> > still
> >> > > dead. Is this the expected behavior?
> >> > >
> >> > > Not really. After replica.lag.time.max.ms (which defaults to 10
> >> > seconds),
> >> > > the leader should remove the dead broker out of the ISR.
> >> > >
> >> > > Thanks,
> >> > > Neha
> >> > >
> >> > > On Tue, Oct 14, 2014 at 9:27 AM, Jean-Pascal Billaud <
> >> j...@tellapart.com>
> >> > > wrote:
> >> > >
> >> > > > hey folks,
> >> > > >
> >> > > > I have been testing a kafka cluster of 10 nodes on AWS using
> version
> >> > > > 2.8.0-0.8.0
> >> > > > and see some behavior on failover that I want to make sure I
> >> > understand.
> >> > > >
> >> > > > Initially, I have a topic X with 30 partitions and a replication
> >> factor
> >> > > of
> >> > > > 3. Looking at the partition 0:
> >> > > > partition: 0 - leader: 5 preferred leader: 5 brokers: [5, 3, 4]
> >> > in-sync:
> >> > > > [5, 3, 4]
> >> > > >
> >> > > > While killing broker 5, the controller immediately grab the next
> >> > replica
> >> > > in
> >> > > > the ISR and assign it as a leader:
> >> > > > partition: 0 - leader: 3 preferred leader: 5 brokers: [5, 3, 4]
> >> > in-sync:
> >> > > > [3, 4]
> >> > > >
> >> > > > There are couple of things at this point I would like to clarify:
> >> > > >
> >> > > > (1) Why is broker 5 still in the brokers array for partition 0?
> Note
> >> > this
> >> > > > broker array comes from a get of the zookeeper path
> >> > > /brokers/topics/[topic]
> >> > > > as documented.
> >> > > > (2) Partition 0 is now under replicated and the controller does
> not
> >> > seem
> >> > > to
> >> > > > do anything about. Is this expected?
> >> > > >
> >> > > > Regarding (1), I am assuming that it is expected that brokers
> going
> >> > down
> >> > > > will be brought back up soon. At which point, they will pick up
> from
> >> > the
> >> > > > current leader and get back into the ISR. Am I right?
> >> > > >
> >> > > > Regarding (2), I finally kicked off a reassign_partitions admin
> task
> >> > > adding
> >> > > > broker 7 to the replicas list for partition 0 which finally fixed
> >> the
> >> > > under
> >> > > > replicated issue:
> >> > > >
> >> > > > partition: 0 - leader: 3  expected_leader: 3  brokers: [3, 4, 7]
> >> > > in-sync:
> >> > > > [3, 4, 7]
> >> > > >
> >> > > > Is this therefore expected that the user will fix up the under
> >> > > replication
> >> > > > situation? Or maybe it is expected again that broker 5 will come
> >> back
> >> > > soon
> >> > > > and this whole thing is a non-issue once that's true given that
> >> > > > decommissioning brokers is not something supported as of the kafka
> >> > > version
> >> > > > I am using.
> >> > > >
> >> > > > Another thing I'd like to clarify is that for another topic Y,
> >> broker 5
> >> > > was
> >> > > > never removed from the ISR array. Note that Y is an unused topic
> so
> >> I
> >> > am
> >> > > > guessing that technically broker 5 is not out of sync... though it
> >> is
> >> > > still
> >> > > > dead. Is this the expected behavior?
> >> > > >
> >> > > > I'd really appreciate somebody to confirm my understanding,
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Broker brought down and under replicated partitions

Reply via email to