Can you paste the error log for each rebalance try?
You may search for keyword ³exception during rebalance².

On 2/26/15, 7:41 PM, "Ashwin Jayaprakash" <ashwin.jayaprak...@gmail.com>
wrote:

>Just give you some more debugging context, we noticed that the "consumers"
>path becomes empty after all the JVMs have exited because of this error.
>So, when we restart, there are no visible entries in ZK.
>
>On Thu, Feb 26, 2015 at 6:04 PM, Ashwin Jayaprakash <
>ashwin.jayaprak...@gmail.com> wrote:
>
>> Hello, we have a set of JVMs that consume messages from Kafka topics.
>>Each
>> JVM creates 4 ConsumerConnectors that are used by 4 separate threads.
>> These JVMs also create and use the CuratorFramework's Path children
>>cache
>> to watch and keep a sub-tree of the ZooKeeper in sync with other JVMs.
>>This
>> path has several thousand children elements.
>>
>> Everything was working perfectly until one fine day we decided to
>>restart
>> these JVMs. We restart these JVMs to roll in new code every few weeks or
>> so. We never had any problems until suddenly the Kafka consumers on
>>these
>> JVMs were unable to rebalance partitions among themselves.  We have
>>bounced
>> these JVMs before with no issues.
>>
>> The exception:
>> Caused by: kafka.common.ConsumerRebalanceFailedException:
>> group1-system01-27422-kafka-787 can't rebalance after 12 retries
>> at
>> 
>>kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedReba
>>lance(ZookeeperConsumerConnector.scala:432)
>> at
>> 
>>kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsume
>>rConnector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:722)
>> at
>> 
>>kafka.consumer.ZookeeperConsumerConnector$WildcardStreamsHandler.<init>(Z
>>ookeeperConsumerConnector.scala:756)
>> at
>> 
>>kafka.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(Zo
>>okeeperConsumerConnector.scala:145)
>> at
>> 
>>kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreamsByF
>>ilter(ZookeeperConsumerConnector.scala:96)
>> at
>> 
>>kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreamsByF
>>ilter(ZookeeperConsumerConnector.scala:100)
>>
>> We then set rebalance.max.retries=16 and rebalance.backoff.ms=10000.
>>I've
>> seen the Spark-Kafka issue
>> https://issues.apache.org/jira/browse/SPARK-5505 and Jun's
>>recommendation
>> to increase the backoff property.
>>
>> We must've tried restarting these JVMs about 20 times now both with and
>> without the "rebalance.xx" properties. Every time it is the same issue.
>> Except for the first time we applied the "rebalance.backoff.ms=10000"
>> property when all 4 JVMs started! We thought that solved everything and
>> then we tried restarting it just to make sure and then we were back to
>> square one.
>>
>> If we have only 1 thread create 1 ConsumerConnector instead of 4 it
>>works.
>> This way we can have any number of JVMs running 1 ConsumerConnector and
>> they all behave well and rebalance partitions. It is only when we try to
>> start multiple ConsumerConnectors on the same JVM does this problem
>>occur.
>> I'd like to remind you that 4 ConsumerConnectors was working for several
>> months. The ZK sub-tree for our non-Kafka part of the code was small
>>when
>> we started.
>>
>> Does anybody have any thoughts on this? What could be causing this
>>issue?
>> Could there be a Curator/ZK client conflict with the High level Kafka
>> consumer? Or is the number of nodes that we have on ZK from our code
>> causing problems with partition assignment in the Kafka code? Because
>>the
>> Curator framework keeps syncing data in the background while the Kafka
>>code
>> is creating ConsumerConnectors and rebalancing topics.
>>
>> Thanks,
>> Ashwin Jayaprakash.
>>

Reply via email to