Re: Slow machine disrupting the cluster

Gerard Klijs Wed, 21 Sep 2016 04:58:22 -0700

It turned out it was een over-provisioned VM. It was eventually solved by
moving the VM to another cluster. He was also not a little slow but
something in the magnitude of 100 times slower. We are now looking for some
metrics to watch and alert in case it gets slow.


On Fri, Sep 16, 2016 at 4:41 PM David Garcia <dav...@spiceworks.com> wrote:

> To remediate, you could start another broker, rebalance, and then shut
> down the busted broker.  But, you really should put some monitoring on your
> system (to help diagnose the actual problem).  Datadog has a pretty good
> set of articles for using jmx to do this:
> https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
>
> There are lots of jmx metrics gathering tools too…such as jmxtrans:
> https://github.com/jmxtrans/jmxtrans
>
> <confluent-plug>
> confluent also offers tooling (such as command center) to help with
> monitoring.
> </confluent-plug>
>
> As far as mirror maker goes, you can play with the consumer/producer
> timeout settings to make sure the process waits long enough for a slow
> machine.
>
> -David
>
> On 9/16/16, 7:11 AM, "Gerard Klijs" <gerard.kl...@dizzit.com> wrote:
>
>     We just had an interesting issue, luckily this was only on our test
> cluster.
>     Because of some reason one of the machines in a cluster became really
> slow.
>     Because it was still alive, it stil was the leader for some
>     topic-partitions. Our mirror maker reads and writes to multiple
>     topic-partitions on each thread. When committing the offsets this will
> fail
>     for the topic-partitions located on the slow machine, because the
> consumers
>     have timed out. The data for these topic-partitions will be send over
> and
>     over, causing a flood of duplicate messages.
>     What would be the best way to prevent this in the future. Is there
> some way
>     the broker could notice it's performing poorly and shut's off for
> example?
>
>
>

Re: Slow machine disrupting the cluster

Reply via email to