Some level of performance degradation is expected for small duration if the
broker is down. You will have to find out the bottleneck. It could be IOPS
or network bandwidth or some other resource.

When rebalance occurs it spikes the CPU but you are saying the CPU usage is
dropping ..that is quite strange for me.  In my case if I am restarting a
broker it  takes 2-3 minutes to start on EBS and 1 min on Instance store.
When a broker starts it will read each and every available topics and
partions on disk and then only it starts. You can see high I/O during
broker start and then in backgound rebalance occurs and during rebalance
the CPU spike is expected.

I was using an instance store earlier and then switched to SSD ...and it
was all fine.  Same kafka version ...52k mes/sec and 1kb/sec and 8 brokers
with 3 replication factor.  Each Ec2 instance has upto 10Gbps network
bandwidth.


On Wed, Sep 16, 2020 at 5:30 PM Miroslav Tsvetanov <tsvetanov...@gmail.com>
wrote:

> We are using EC2 EBS volume "thoroughput optimized hdd (st1)" from AWS:
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html
> with 3 brokers and replication factor 3.
> There is no data lost we simply accept 10% of the messages sent during this
> time period and the rest are delayed and some of them can reach timeout.
>
> On Wed, Sep 16, 2020 at 1:30 PM Ashutosh singh <getas...@gmail.com> wrote:
>
> > If the cluster is busy then it will have lots of data to rebalance once
> the
> > broker comes online.  What type is your underlying storage ? Are you
> using
> > SSD ?
> >
> > 5k/sec and avg size 3kb  i.e. 15000Kb (14.6 MB /sec ) .  So if your
> broker
> > is down for 10 minutes then approx 8 GB data need to rebalance and again
> it
> > will depend on replication factor.
> >
> >
> >
> >
> > On Wed, Sep 16, 2020 at 3:09 PM Miroslav Tsvetanov <
> tsvetanov...@gmail.com
> > >
> > wrote:
> >
> > > Hello All,
> > >
> > > We are running Kafka in production with 3 brokers and Kafka version
> > 2.1.1.
> > > We have noticed that when a Kafka broker was stopped for more than 10
> > > minutes and we are starting it again, after the start-up we are facing
> > > degradation of around 90% for up to 4 minutes.
> > > During this period(of around 4 minutes) we observe CPU usage reduction
> > from
> > > 22% to 2% at all of the brokers. Also, the broker which has just been
> > > started have network-out 7 MB/min and network-in 2.2 GB/min, on the
> other
> > > hand, the rest of the brokers has network out 1.1 GB/min and network-in
> > 55
> > > MB/min.
> > > We assume that this is due to the fact that the broker, who has been
> > > stopped for more than 10 minutes, must catch up with the messages that
> > have
> > > been processed during the time while he was stopped.
> > > The performance degradation persists until all 3 brokers become insync
> > (we
> > > have min.insync.replicas=2 and replication factor of 3).
> > >
> > > It is worth mentioning that we have ~5k messages per/sec with an
> average
> > > size ~3kb.
> > >
> > > We also try to increase broker nodes to 5 (rebalanced) and run with
> > > https://kafka.apache.org/081/documentation.html#prodconfig and still
> see
> > > ~35% performance degradation.
> > >
> > > Thanks in advance.
> > >
> > > Best regards,
> > > Miroslav
> > >
> >
> >
> > --
> > Thanx & Regard
> > Ashutosh Singh
> > 08151945559
> >
>


-- 
Thanx & Regard
Ashutosh Singh
08151945559

Reply via email to