Re: [DISCUSS] KIP-143: Controller Health Metrics

Joel Koshy Thu, 27 Apr 2017 11:39:42 -0700

Thanks for the KIP - couple of comments:
- Do you intend to actually use yammer metrics? or use kafka-metrics and
split the timer into an explicit rate and time? I think long term we ought
to move off yammer and use kafka-metrics only. Actually either is fine, but
we should ideally use only one in the long term - and I thought the plan
was to use kafka-metrics.
- metric #9 appears to be redundant since we already have per-API request
rate and time metrics.
- Same for metric #4, #5 (as there are request stats for DeleteTopicRequest
- although it is possible for users to trigger deletes via ZK)
- metric #2, #3 are potentially useful, but a bit overkill for a histogram.
Alternative is to stick to last known value, but that doesn't play well
with alerts if a high value isn't reset/decayed. Perhaps metric #1 would be
sufficient to gauge slow start/resignation transitions.
- metric #1 - some of the states may actually overlap
- I don't actually understand the semantics of metric #6. Is it rate of
partition reassignment triggers? does the number of partitions matter?


Joel

On Thu, Apr 27, 2017 at 8:04 AM, Tom Crayford <[email protected]> wrote:

> Ismael,
>
> Great, that sounds lovely.
>
> I'd like a `Timer` (using yammer metrics parlance) over how long it took to
> process the event, so we can get at p99 and max times spent processing
> things. Maybe we could even do a log at warning level if event processing
> takes over some timeout?
>
> Thanks
>
> Tom
>
> On Thu, Apr 27, 2017 at 3:59 PM, Ismael Juma <[email protected]> wrote:
>
> > Hi Tom,
> >
> > Yes, the plan is to merge KAFKA-5028 first and then use a lock-free
> > approach for the new  metrics. I considered mentioning that in the KIP
> > given KAFKA-5120, but didn't in the end. I'll add it to make it clear.
> >
> > Regarding locks, they are removed by KAFKA-5028, as you say. So, if I
> > understand correctly, you are suggesting an event processing rate metric
> > with event type as a tag? Onur and Jun, what do you think?
> >
> > Ismael
> >
> > On Thu, Apr 27, 2017 at 3:47 PM, Tom Crayford <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > We (Heroku) are very excited about this KIP, as we've struggled a bit
> > with
> > > controller stability recently. Having these additional metrics would be
> > > wonderful.
> > >
> > > I'd like to ensure polling these metrics *doesn't* hold any locks etc,
> > > because, as noted in https://issues.apache.org/jira/browse/KAFKA-5120,
> > > that
> > > lock can be held for quite some time. This may become not an issue as
> of
> > > KAFKA-5028 though.
> > >
> > > Lastly, I'd love to see some metrics around how long the controller
> > spends
> > > inside its lock. We've been tracking an issue (
> > > https://issues.apache.org/jira/browse/KAFKA-5116) where it can hold
> the
> > > lock for many, many minutes in a zk client listener thread when
> > responding
> > > to a single request. I'm not sure how that plays into
> > > https://issues.apache.org/jira/browse/KAFKA-5028 (which I assume will
> > land
> > > before this metrics patch), but it feels like there will be equivalent
> > > problems ("how long does it spend processing any individual message
> from
> > > the queue, broken down by message type").
> > >
> > > These are minor improvements though, the addition of more metrics to
> the
> > > controller is already going to be very helpful.
> > >
> > > Thanks
> > >
> > > Tom Crayford
> > > Heroku Kafka
> > >
> > > On Thu, Apr 27, 2017 at 3:10 PM, Ismael Juma <[email protected]>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > We've posted "KIP-143: Controller Health Metrics" for discussion:
> > > >
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 143%3A+Controller+Health+Metrics
> > > >
> > > > Please take a look. Your feedback is appreciated.
> > > >
> > > > Thanks,
> > > > Ismael
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-143: Controller Health Metrics

Reply via email to