Can you paste the error log for each rebalance try? You may search for keyword ³exception during rebalance².
On 2/26/15, 7:41 PM, "Ashwin Jayaprakash" <ashwin.jayaprak...@gmail.com> wrote: >Just give you some more debugging context, we noticed that the "consumers" >path becomes empty after all the JVMs have exited because of this error. >So, when we restart, there are no visible entries in ZK. > >On Thu, Feb 26, 2015 at 6:04 PM, Ashwin Jayaprakash < >ashwin.jayaprak...@gmail.com> wrote: > >> Hello, we have a set of JVMs that consume messages from Kafka topics. >>Each >> JVM creates 4 ConsumerConnectors that are used by 4 separate threads. >> These JVMs also create and use the CuratorFramework's Path children >>cache >> to watch and keep a sub-tree of the ZooKeeper in sync with other JVMs. >>This >> path has several thousand children elements. >> >> Everything was working perfectly until one fine day we decided to >>restart >> these JVMs. We restart these JVMs to roll in new code every few weeks or >> so. We never had any problems until suddenly the Kafka consumers on >>these >> JVMs were unable to rebalance partitions among themselves. We have >>bounced >> these JVMs before with no issues. >> >> The exception: >> Caused by: kafka.common.ConsumerRebalanceFailedException: >> group1-system01-27422-kafka-787 can't rebalance after 12 retries >> at >> >>kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedReba >>lance(ZookeeperConsumerConnector.scala:432) >> at >> >>kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsume >>rConnector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:722) >> at >> >>kafka.consumer.ZookeeperConsumerConnector$WildcardStreamsHandler.<init>(Z >>ookeeperConsumerConnector.scala:756) >> at >> >>kafka.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(Zo >>okeeperConsumerConnector.scala:145) >> at >> >>kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreamsByF >>ilter(ZookeeperConsumerConnector.scala:96) >> at >> >>kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreamsByF >>ilter(ZookeeperConsumerConnector.scala:100) >> >> We then set rebalance.max.retries=16 and rebalance.backoff.ms=10000. >>I've >> seen the Spark-Kafka issue >> https://issues.apache.org/jira/browse/SPARK-5505 and Jun's >>recommendation >> to increase the backoff property. >> >> We must've tried restarting these JVMs about 20 times now both with and >> without the "rebalance.xx" properties. Every time it is the same issue. >> Except for the first time we applied the "rebalance.backoff.ms=10000" >> property when all 4 JVMs started! We thought that solved everything and >> then we tried restarting it just to make sure and then we were back to >> square one. >> >> If we have only 1 thread create 1 ConsumerConnector instead of 4 it >>works. >> This way we can have any number of JVMs running 1 ConsumerConnector and >> they all behave well and rebalance partitions. It is only when we try to >> start multiple ConsumerConnectors on the same JVM does this problem >>occur. >> I'd like to remind you that 4 ConsumerConnectors was working for several >> months. The ZK sub-tree for our non-Kafka part of the code was small >>when >> we started. >> >> Does anybody have any thoughts on this? What could be causing this >>issue? >> Could there be a Curator/ZK client conflict with the High level Kafka >> consumer? Or is the number of nodes that we have on ZK from our code >> causing problems with partition assignment in the Kafka code? Because >>the >> Curator framework keeps syncing data in the background while the Kafka >>code >> is creating ConsumerConnectors and rebalancing topics. >> >> Thanks, >> Ashwin Jayaprakash. >>