Which version of Kafka are you using? The current stable one is 0.8.1.1 On Tue, Oct 14, 2014 at 5:51 PM, Jean-Pascal Billaud <j...@tellapart.com> wrote:
> Hey Neha, > > so I removed another broker like 30mn ago and since then basically the > Producer is dying with: > > Event queue is full of unsent messages, could not send event: > KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7) > kafka.common.QueueFullException: Event queue is full of unsent messages, > could not send event: KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7) > at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source) > ~[kafka_2.10-0.8.0.jar:0.8.0] > at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source) > ~[kafka_2.10-0.8.0.jar:0.8.0] > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > ~[scala-library-2.10.3.jar:na] > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > ~[scala-library-2.10.3.jar:na] > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > ~[scala-library-2.10.3.jar:na] > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > ~[scala-library-2.10.3.jar:na] > at kafka.producer.Producer.asyncSend(Unknown Source) > ~[kafka_2.10-0.8.0.jar:0.8.0] > at kafka.producer.Producer.send(Unknown Source) > ~[kafka_2.10-0.8.0.jar:0.8.0] > at kafka.javaapi.producer.Producer.send(Unknown Source) > ~[kafka_2.10-0.8.0.jar:0.8.0] > > It seems like it cannot recover for some reasons. The new leaders were > elected it seems like so it should have picked up the new meta data > information about the partitions. Is this something known from 0.8.0? What > should be looking for to debug/fix this? > > Thanks, > > On Tue, Oct 14, 2014 at 2:22 PM, Neha Narkhede <neha.narkh...@gmail.com> > wrote: > > > Regarding (1), I am assuming that it is expected that brokers going down > > will be brought back up soon. At which point, they will pick up from the > > current leader and get back into the ISR. Am I right? > > > > The broker will be added back to the ISR once it is restarted, but it > never > > goes out of the replica list until the admin explicitly moves it using > the > > reassign partitions tool. > > > > Regarding (2), I finally kicked off a reassign_partitions admin task > adding > > broker 7 to the replicas list for partition 0 which finally fixed the > under > > replicated issue: > > Is this therefore expected that the user will fix up the under > replication > > situation? > > > > Yes. Currently, partition reassignment is purely an admin initiated task. > > > > Another thing I'd like to clarify is that for another topic Y, broker 5 > was > > never removed from the ISR array. Note that Y is an unused topic so I am > > guessing that technically broker 5 is not out of sync... though it is > still > > dead. Is this the expected behavior? > > > > Not really. After replica.lag.time.max.ms (which defaults to 10 > seconds), > > the leader should remove the dead broker out of the ISR. > > > > Thanks, > > Neha > > > > On Tue, Oct 14, 2014 at 9:27 AM, Jean-Pascal Billaud <j...@tellapart.com> > > wrote: > > > > > hey folks, > > > > > > I have been testing a kafka cluster of 10 nodes on AWS using version > > > 2.8.0-0.8.0 > > > and see some behavior on failover that I want to make sure I > understand. > > > > > > Initially, I have a topic X with 30 partitions and a replication factor > > of > > > 3. Looking at the partition 0: > > > partition: 0 - leader: 5 preferred leader: 5 brokers: [5, 3, 4] > in-sync: > > > [5, 3, 4] > > > > > > While killing broker 5, the controller immediately grab the next > replica > > in > > > the ISR and assign it as a leader: > > > partition: 0 - leader: 3 preferred leader: 5 brokers: [5, 3, 4] > in-sync: > > > [3, 4] > > > > > > There are couple of things at this point I would like to clarify: > > > > > > (1) Why is broker 5 still in the brokers array for partition 0? Note > this > > > broker array comes from a get of the zookeeper path > > /brokers/topics/[topic] > > > as documented. > > > (2) Partition 0 is now under replicated and the controller does not > seem > > to > > > do anything about. Is this expected? > > > > > > Regarding (1), I am assuming that it is expected that brokers going > down > > > will be brought back up soon. At which point, they will pick up from > the > > > current leader and get back into the ISR. Am I right? > > > > > > Regarding (2), I finally kicked off a reassign_partitions admin task > > adding > > > broker 7 to the replicas list for partition 0 which finally fixed the > > under > > > replicated issue: > > > > > > partition: 0 - leader: 3 expected_leader: 3 brokers: [3, 4, 7] > > in-sync: > > > [3, 4, 7] > > > > > > Is this therefore expected that the user will fix up the under > > replication > > > situation? Or maybe it is expected again that broker 5 will come back > > soon > > > and this whole thing is a non-issue once that's true given that > > > decommissioning brokers is not something supported as of the kafka > > version > > > I am using. > > > > > > Another thing I'd like to clarify is that for another topic Y, broker 5 > > was > > > never removed from the ISR array. Note that Y is an unused topic so I > am > > > guessing that technically broker 5 is not out of sync... though it is > > still > > > dead. Is this the expected behavior? > > > > > > I'd really appreciate somebody to confirm my understanding, > > > > > > Thanks, > > > > > >