Is there a known issue in the 0.8.0 version that was fixed later on? What can I do to diagnose/fix the situation?
Yes, quite a few bugs related to this have been fixed since 0.8.0. I'd suggest upgrading to 0.8.1.1 On Wed, Oct 15, 2014 at 11:09 PM, Jean-Pascal Billaud <j...@tellapart.com> wrote: > The only thing that I find very weird is the fact that brokers that are > dead are still part of the ISR set for hours... and are basically not > removed. Note this is not constantly the case, most of the dead brokers are > properly removed and it is really just in a few cases. I am not sure why > this would happen. Is there a known issue in the 0.8.0 version that was > fixed later on? What can I do to diagnose/fix the situation? > > Thanks, > > On Wed, Oct 15, 2014 at 9:58 AM, Jean-Pascal Billaud <j...@tellapart.com> > wrote: > > > So I am using 0.8.0. I think I found the issue actually. It turns out > that > > some partitions only had a single replica and the leaders of those > > partitions would basically "refuse" new writes. As soon as I reassigned > > replicas to those partitions things kicked off again. Not sure if that's > > expected... but that seemed to make the problem go away. > > > > Thanks, > > > > > > On Wed, Oct 15, 2014 at 6:46 AM, Neha Narkhede <neha.narkh...@gmail.com> > > wrote: > > > >> Which version of Kafka are you using? The current stable one is 0.8.1.1 > >> > >> On Tue, Oct 14, 2014 at 5:51 PM, Jean-Pascal Billaud <j...@tellapart.com> > >> wrote: > >> > >> > Hey Neha, > >> > > >> > so I removed another broker like 30mn ago and since then basically the > >> > Producer is dying with: > >> > > >> > Event queue is full of unsent messages, could not send event: > >> > KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7) > >> > kafka.common.QueueFullException: Event queue is full of unsent > messages, > >> > could not send event: KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7) > >> > at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source) > >> > ~[kafka_2.10-0.8.0.jar:0.8.0] > >> > at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source) > >> > ~[kafka_2.10-0.8.0.jar:0.8.0] > >> > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > >> > ~[scala-library-2.10.3.jar:na] > >> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > >> > ~[scala-library-2.10.3.jar:na] > >> > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > >> > ~[scala-library-2.10.3.jar:na] > >> > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > >> > ~[scala-library-2.10.3.jar:na] > >> > at kafka.producer.Producer.asyncSend(Unknown Source) > >> > ~[kafka_2.10-0.8.0.jar:0.8.0] > >> > at kafka.producer.Producer.send(Unknown Source) > >> > ~[kafka_2.10-0.8.0.jar:0.8.0] > >> > at kafka.javaapi.producer.Producer.send(Unknown Source) > >> > ~[kafka_2.10-0.8.0.jar:0.8.0] > >> > > >> > It seems like it cannot recover for some reasons. The new leaders were > >> > elected it seems like so it should have picked up the new meta data > >> > information about the partitions. Is this something known from 0.8.0? > >> What > >> > should be looking for to debug/fix this? > >> > > >> > Thanks, > >> > > >> > On Tue, Oct 14, 2014 at 2:22 PM, Neha Narkhede < > neha.narkh...@gmail.com > >> > > >> > wrote: > >> > > >> > > Regarding (1), I am assuming that it is expected that brokers going > >> down > >> > > will be brought back up soon. At which point, they will pick up from > >> the > >> > > current leader and get back into the ISR. Am I right? > >> > > > >> > > The broker will be added back to the ISR once it is restarted, but > it > >> > never > >> > > goes out of the replica list until the admin explicitly moves it > using > >> > the > >> > > reassign partitions tool. > >> > > > >> > > Regarding (2), I finally kicked off a reassign_partitions admin task > >> > adding > >> > > broker 7 to the replicas list for partition 0 which finally fixed > the > >> > under > >> > > replicated issue: > >> > > Is this therefore expected that the user will fix up the under > >> > replication > >> > > situation? > >> > > > >> > > Yes. Currently, partition reassignment is purely an admin initiated > >> task. > >> > > > >> > > Another thing I'd like to clarify is that for another topic Y, > broker > >> 5 > >> > was > >> > > never removed from the ISR array. Note that Y is an unused topic so > I > >> am > >> > > guessing that technically broker 5 is not out of sync... though it > is > >> > still > >> > > dead. Is this the expected behavior? > >> > > > >> > > Not really. After replica.lag.time.max.ms (which defaults to 10 > >> > seconds), > >> > > the leader should remove the dead broker out of the ISR. > >> > > > >> > > Thanks, > >> > > Neha > >> > > > >> > > On Tue, Oct 14, 2014 at 9:27 AM, Jean-Pascal Billaud < > >> j...@tellapart.com> > >> > > wrote: > >> > > > >> > > > hey folks, > >> > > > > >> > > > I have been testing a kafka cluster of 10 nodes on AWS using > version > >> > > > 2.8.0-0.8.0 > >> > > > and see some behavior on failover that I want to make sure I > >> > understand. > >> > > > > >> > > > Initially, I have a topic X with 30 partitions and a replication > >> factor > >> > > of > >> > > > 3. Looking at the partition 0: > >> > > > partition: 0 - leader: 5 preferred leader: 5 brokers: [5, 3, 4] > >> > in-sync: > >> > > > [5, 3, 4] > >> > > > > >> > > > While killing broker 5, the controller immediately grab the next > >> > replica > >> > > in > >> > > > the ISR and assign it as a leader: > >> > > > partition: 0 - leader: 3 preferred leader: 5 brokers: [5, 3, 4] > >> > in-sync: > >> > > > [3, 4] > >> > > > > >> > > > There are couple of things at this point I would like to clarify: > >> > > > > >> > > > (1) Why is broker 5 still in the brokers array for partition 0? > Note > >> > this > >> > > > broker array comes from a get of the zookeeper path > >> > > /brokers/topics/[topic] > >> > > > as documented. > >> > > > (2) Partition 0 is now under replicated and the controller does > not > >> > seem > >> > > to > >> > > > do anything about. Is this expected? > >> > > > > >> > > > Regarding (1), I am assuming that it is expected that brokers > going > >> > down > >> > > > will be brought back up soon. At which point, they will pick up > from > >> > the > >> > > > current leader and get back into the ISR. Am I right? > >> > > > > >> > > > Regarding (2), I finally kicked off a reassign_partitions admin > task > >> > > adding > >> > > > broker 7 to the replicas list for partition 0 which finally fixed > >> the > >> > > under > >> > > > replicated issue: > >> > > > > >> > > > partition: 0 - leader: 3 expected_leader: 3 brokers: [3, 4, 7] > >> > > in-sync: > >> > > > [3, 4, 7] > >> > > > > >> > > > Is this therefore expected that the user will fix up the under > >> > > replication > >> > > > situation? Or maybe it is expected again that broker 5 will come > >> back > >> > > soon > >> > > > and this whole thing is a non-issue once that's true given that > >> > > > decommissioning brokers is not something supported as of the kafka > >> > > version > >> > > > I am using. > >> > > > > >> > > > Another thing I'd like to clarify is that for another topic Y, > >> broker 5 > >> > > was > >> > > > never removed from the ISR array. Note that Y is an unused topic > so > >> I > >> > am > >> > > > guessing that technically broker 5 is not out of sync... though it > >> is > >> > > still > >> > > > dead. Is this the expected behavior? > >> > > > > >> > > > I'd really appreciate somebody to confirm my understanding, > >> > > > > >> > > > Thanks, > >> > > > > >> > > > >> > > >> > > > > >