Hi Stanislav,

Thanks for the feedback and sharing that discussion thread.

I read your KIP and the discussion on it too and it seems like that'd cover
the same motivation I had with the log-cleaner-thread-count metric. This
supposed to tell the count of the alive threads which might differ from the
config (I could've used a better name :) ). Now I'm thinking that using
uncleanable-bytes, uncleanable-partition-count together with
time-since-last-run would mostly cover the motivation I have in this KIP,
however displaying the thread count (be it alive or dead) would still add
extra information regarding the failure, that a thread died during cleanup.

You had a very good idea about instead of the alive threads, display the
dead ones! That way we wouldn't need log-cleaner-current-live-thread-rate
just a "dead-log-cleaner-thread-count" as it it would make easy to trigger
warnings based on that (if it's even > 0 then we can say there's a
potential problem).
Doing this on the replica fetchers though would be a bit harder as the
number of replica fetchers is the (brokers-to-fetch-from *
fetchers-per-broker) and we don't really maintain the capacity information
or any kind of cluster information and I'm not sure we should. It would add
too much responsibility to the class and wouldn't be a rock-solid solution
but I guess I have to look into it more.

I don't think that restarting the cleaner threads would be helpful as the
problems I've seen mostly are non-recoverable and requires manual user
intervention and partly I agree what Colin said on the KIP-346 discussion
thread about the problems experienced with HDFS.

Best,
Viktor


On Fri, Feb 22, 2019 at 5:03 PM Stanislav Kozlovski <stanis...@confluent.io>
wrote:

> Hey Viktor,
>
> First off, thanks for the KIP! I think that it is almost always a good idea
> to have more metrics. Observability never hurts.
>
> In regards to the LogCleaner:
> * Do we need to know log-cleaner-thread-count? That should always be equal
> to "log.cleaner.threads" if I'm not mistaken.
> * log-cleaner-current-live-thread-rate -  We already have the
> "time-since-last-run-ms" metric which can let you know if something is
> wrong with the log cleaning
> As you said, we would like to have these two new metrics in order to
> understand when a partial failure has happened - e.g only 1/3 log cleaner
> threads are alive. I'm wondering if it may make more sense to either:
> a) restart the threads when they die
> b) add a metric which shows the dead thread count. You should probably
> always have a low-level alert in the case that any threads have died
>
> We had discussed a similar topic about thread revival and metrics in
> KIP-346. Have you had a chance to look over that discussion? Here is the
> mailing discussion for that -
>
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201807.mbox/%3ccanzzngyr_22go9swl67hedcm90xhvpyfzy_tezhz1mrizqk...@mail.gmail.com%3E
>
> Best,
> Stanislav
>
>
>
> On Fri, Feb 22, 2019 at 11:18 AM Viktor Somogyi-Vass <
> viktorsomo...@gmail.com> wrote:
>
> > Hi All,
> >
> > I'd like to start a discussion about exposing count gauge metrics for the
> > replica fetcher and log cleaner thread counts. It isn't a long KIP and
> the
> > motivation is very simple: monitoring the thread counts in these cases
> > would help with the investigation of various issues and might help in
> > preventing more serious issues when a broker is in a bad state. Such a
> > scenario that we seen with users is that their disk fills up as the log
> > cleaner died for some reason and couldn't recover (like log corruption).
> In
> > this case an early warning would help in the root cause analysis process
> as
> > well as enable detecting and resolving the problem early on.
> >
> > The KIP is here:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics
> >
> > I'd be happy to receive any feedback on this.
> >
> > Regards,
> > Viktor
> >
>
>
> --
> Best,
> Stanislav
>

Reply via email to