Yep, the team here, including Ismael, pointed me in the right direction, which was much appreciated. :)
On Thu, Nov 9, 2017 at 10:02 AM, Viktor Somogyi <viktorsomo...@gmail.com> wrote: > I'm happy that it's solved :) > > On Thu, Nov 9, 2017 at 3:32 PM, John Yost <hokiege...@gmail.com> wrote: > > > Excellent points Viktor! Also, the reason I mistakenly went > 8 GB memory > > heap was due to OOM errors that were being thrown when I upgraded from > > 0.9.0.1 to 0.10.0.0 and forgot to explicitly set the message format to > > 0.9.0.1 because we needed to support the older clients and the > > corresponding format. once I set the message format to 0.9.0.1, the > memory > > requirements went WAY down, I reset the memory heap to 6 GB, and our > Kafka > > cluster has been awesome since. > > > > --John > > > > On Thu, Nov 9, 2017 at 9:09 AM, Viktor Somogyi <viktorsomo...@gmail.com> > > wrote: > > > > > Hi Json. > > > > > > John might have a point. It is not reasonable to have more than 6-8GB > of > > > heap provided for the JVM that's running Kafka. One of the reason is GC > > > time and the other is that Kafka relies heavily on the OS' disk > > read/write > > > in-memory caching. > > > Also there were a few synchronization bugs in 0.9 which caused similar > > > problems. I would recommend you to upgrade to 1.0.0 if that is > feasible. > > > > > > Viktor > > > > > > > > > On Thu, Nov 9, 2017 at 2:59 PM, John Yost <hokiege...@gmail.com> > wrote: > > > > > > > I've seen this before and it was due to long GC pauses due in large > > part > > > to > > > > a memory heap > 8 GB. > > > > > > > > --John > > > > > > > > On Thu, Nov 9, 2017 at 8:17 AM, Json Tu <kafka...@126.com> wrote: > > > > > > > > > Hi, > > > > > we have a kafka cluster which is made of 6 brokers, with 8 cpu > > and > > > > > 16G memory on each broker’s machine, and we have about 1600 topics > in > > > the > > > > > cluster,about 1700 partitions’ leader and 1600 partitions' replica > on > > > > each > > > > > broker. > > > > > when we restart a normal broke, we find that there are 500+ > > > > > partitions shrink and expand frequently when restart the broker, > > > > > there are many logs as below. > > > > > > > > > > [2017-11-09 17:05:51,173] INFO Partition [Yelp,5] on broker > > 4759726: > > > > > Expanding ISR for partition [Yelp,5] from 4759726 to > 4759726,4759750 > > > > > (kafka.cluster.Partition) > > > > > [2017-11-09 17:06:22,047] INFO Partition [Yelp,5] on broker > 4759726: > > > > > Shrinking ISR for partition [Yelp,5] from 4759726,4759750 to > 4759726 > > > > > (kafka.cluster.Partition) > > > > > [2017-11-09 17:06:28,634] INFO Partition [Yelp,5] on broker > 4759726: > > > > > Expanding ISR for partition [Yelp,5] from 4759726 to > 4759726,4759750 > > > > > (kafka.cluster.Partition) > > > > > [2017-11-09 17:06:44,658] INFO Partition [Yelp,5] on broker > 4759726: > > > > > Shrinking ISR for partition [Yelp,5] from 4759726,4759750 to > 4759726 > > > > > (kafka.cluster.Partition) > > > > > [2017-11-09 17:06:47,611] INFO Partition [Yelp,5] on broker > 4759726: > > > > > Expanding ISR for partition [Yelp,5] from 4759726 to > 4759726,4759750 > > > > > (kafka.cluster.Partition) > > > > > [2017-11-09 17:07:19,703] INFO Partition [Yelp,5] on broker > 4759726: > > > > > Shrinking ISR for partition [Yelp,5] from 4759726,4759750 to > 4759726 > > > > > (kafka.cluster.Partition) > > > > > [2017-11-09 17:07:26,811] INFO Partition [Yelp,5] on broker > 4759726: > > > > > Expanding ISR for partition [Yelp,5] from 4759726 to > 4759726,4759750 > > > > > (kafka.cluster.Partition) > > > > > … > > > > > > > > > > > > > > > and repeat shrink and expand after 30 minutes which is the > > default > > > > > value of leader.imbalance.check.interval.seconds, and at that time > > > > > we can find the log of controller’s auto rebalance,which can leads > > some > > > > > partition’s leader change to this restarted broker. > > > > > we have no shrink and expand when our cluster is running except > > > when > > > > > we restart it,so replica.fetch.thread.num is 1,and it seems enough. > > > > > > > > > > we can reproduce it at each restart,can someone give some > > > > suggestions. > > > > > thanks before. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >