Thanks for the KIP - couple of comments: - Do you intend to actually use yammer metrics? or use kafka-metrics and split the timer into an explicit rate and time? I think long term we ought to move off yammer and use kafka-metrics only. Actually either is fine, but we should ideally use only one in the long term - and I thought the plan was to use kafka-metrics. - metric #9 appears to be redundant since we already have per-API request rate and time metrics. - Same for metric #4, #5 (as there are request stats for DeleteTopicRequest - although it is possible for users to trigger deletes via ZK) - metric #2, #3 are potentially useful, but a bit overkill for a histogram. Alternative is to stick to last known value, but that doesn't play well with alerts if a high value isn't reset/decayed. Perhaps metric #1 would be sufficient to gauge slow start/resignation transitions. - metric #1 - some of the states may actually overlap - I don't actually understand the semantics of metric #6. Is it rate of partition reassignment triggers? does the number of partitions matter?
Joel On Thu, Apr 27, 2017 at 8:04 AM, Tom Crayford <tcrayf...@heroku.com> wrote: > Ismael, > > Great, that sounds lovely. > > I'd like a `Timer` (using yammer metrics parlance) over how long it took to > process the event, so we can get at p99 and max times spent processing > things. Maybe we could even do a log at warning level if event processing > takes over some timeout? > > Thanks > > Tom > > On Thu, Apr 27, 2017 at 3:59 PM, Ismael Juma <ism...@juma.me.uk> wrote: > > > Hi Tom, > > > > Yes, the plan is to merge KAFKA-5028 first and then use a lock-free > > approach for the new metrics. I considered mentioning that in the KIP > > given KAFKA-5120, but didn't in the end. I'll add it to make it clear. > > > > Regarding locks, they are removed by KAFKA-5028, as you say. So, if I > > understand correctly, you are suggesting an event processing rate metric > > with event type as a tag? Onur and Jun, what do you think? > > > > Ismael > > > > On Thu, Apr 27, 2017 at 3:47 PM, Tom Crayford <tcrayf...@heroku.com> > > wrote: > > > > > Hi, > > > > > > We (Heroku) are very excited about this KIP, as we've struggled a bit > > with > > > controller stability recently. Having these additional metrics would be > > > wonderful. > > > > > > I'd like to ensure polling these metrics *doesn't* hold any locks etc, > > > because, as noted in https://issues.apache.org/jira/browse/KAFKA-5120, > > > that > > > lock can be held for quite some time. This may become not an issue as > of > > > KAFKA-5028 though. > > > > > > Lastly, I'd love to see some metrics around how long the controller > > spends > > > inside its lock. We've been tracking an issue ( > > > https://issues.apache.org/jira/browse/KAFKA-5116) where it can hold > the > > > lock for many, many minutes in a zk client listener thread when > > responding > > > to a single request. I'm not sure how that plays into > > > https://issues.apache.org/jira/browse/KAFKA-5028 (which I assume will > > land > > > before this metrics patch), but it feels like there will be equivalent > > > problems ("how long does it spend processing any individual message > from > > > the queue, broken down by message type"). > > > > > > These are minor improvements though, the addition of more metrics to > the > > > controller is already going to be very helpful. > > > > > > Thanks > > > > > > Tom Crayford > > > Heroku Kafka > > > > > > On Thu, Apr 27, 2017 at 3:10 PM, Ismael Juma <ism...@juma.me.uk> > wrote: > > > > > > > Hi all, > > > > > > > > We've posted "KIP-143: Controller Health Metrics" for discussion: > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > 143%3A+Controller+Health+Metrics > > > > > > > > Please take a look. Your feedback is appreciated. > > > > > > > > Thanks, > > > > Ismael > > > > > > > > > >