Re: [DISCUSS] KIP-1068: KIP-1068: New JMX Metrics for AsyncKafkaConsumer

Mickael Maison Wed, 24 Jul 2024 03:50:17 -0700

Hi,

3. Sorry I misunderstood something and got confused as the KIP
mentions AsyncKafkaConsumer as a new consumer implementation.
I now see that it's not an alternative to KafkaConsumer but just the
"backend" used with the new protocol. So yes any new metrics added are
public API and I agree AsyncKafkaConsumer should not be public.


Thanks,
Mickael

On Wed, Jul 24, 2024 at 12:27 PM Bruno Cadonna <cado...@apache.org> wrote:
>
> Hi Mickael,
>
> 1. I agree that the title is misleading. It should be something like
> "New metrics for the AsyncKafkaConsumer". Maybe it should even be
> "Metrics for the new consumer".
>
> 3. I am not sure I understand this comment. Exposed metrics are public
> as far as I understand. So adding new metrics requires a KIP. We had
> KIPs in the past that only introduced and/or removed new metrics. For
> example,
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1001%3A+Add+CurrentControllerId+Metric
> I do not think there are any plans to make the AsyncKafkaConsumer class
> public. It is the implementation of the KafkaConsumer when the CONSUMER
> protocol is used. But I might be wrong. At least, I am against making it
> public.
>
> Best,
> Bruno
>
> On 7/24/24 12:05 PM, Mickael Maison wrote:
> > Hi,
> >
> > 1. The title is a bit misleading. It's proposing to add new metrics,
> > JMX is just one of the mechanisms to export them.
> > 2. +1 to not register the new metrics when using the classic consumer,
> > instead of setting them to 0. Similarly I assume existing metrics that
> > don't apply to the new consumers are not registered?
> > 3. At the moment this KIP is not changing any public APIs. What's the
> > plan to make AsyncKafkaConsumer public?
> >
> > Thanks,
> > Mickael
> >
> >
> >
> > On Tue, Jul 23, 2024 at 6:03 PM Bruno Cadonna <cado...@apache.org> wrote:
> >>
> >> Hi Brenden,
> >>
> >> BC1. In his first e-mail Andrew wrote "I would expect that the metrics
> >> do not exist at all". I agree with him. I think it would be better to
> >> not add those metrics at all if the CLASSIC protocol is used rather than
> >> the metrics exist and are all constant 0. This should be possible by not
> >> adding the metrics to the metrics registry if the CONSUMER protocol is
> >> not used.
> >>
> >> BC2. Is there a specific reason you do not propose
> >> background-event-queue-time-max and background-event-queue-time-avg? If
> >> folk think those are not useful we do not need to add them. However, if
> >> those are not useful, why is background-event-queue-size useful. I was
> >> just wondering about the asymmetry between background-event-queue and
> >> application-event-queue.
> >>
> >> Best,
> >> Bruno
> >>
> >>
> >>
> >> On 7/19/24 9:14 PM, Brenden Deluna wrote:
> >>> Hi Apoorv,
> >>> Thank you for your comments, I will address each.
> >>>
> >>> AM1. I can see the usefulness in also having an
> >>> 'application-event-queue-age-max' to get an idea of outliers and how they
> >>> may be affecting the average metric. I will add that.
> >>>
> >>> AM2. I agree with you there, I think 'time' is a better descriptor here
> >>> than 'age'. I will update those metric names as well.
> >>>
> >>> AM3. Similar to above comments, I will change the name of that metric to 
> >>> be
> >>> more consistent. And I think a max metric would also be useful here, 
> >>> adding
> >>> that.
> >>>
> >>> AM4. Yes, good catch there. Will update that as well.
> >>>
> >>> Thank you,
> >>> Brenden
> >>>
> >>> On Fri, Jul 19, 2024 at 8:14 AM Apoorv Mittal <apoorvmitta...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi Brendan,
> >>>> Thanks for the KIP. The metrics are always helpful.
> >>>>
> >>>> AM1: Is `application-event-queue-age-avg` enough or do we require `
> >>>> application-event-queue-age-max` as well to differentiate with outliers?
> >>>>
> >>>> AM2: The kafka producer defines metric `record-queue-time-avg` which
> >>>> captures the time spent in the buffer. Do you think we should have a
> >>>> similar name for `application-event-queue-age-avg` i.e. change to `
> >>>> application-event-queue-time-avg`? Moreover other than similar naming,
> >>>> `time` anyways seems more suitable than `age`, though minor. The `time`
> >>>> usage is also aligned with the description of this metric.
> >>>>
> >>>> AM3: Metric `application-event-processing-time` says "the average time,
> >>>> that the consumer network.....". Shall we have the `-avg` suffix in the
> >>>> metric as we have defined for other metrics? Also do we require the max
> >>>> metric as well for the same?
> >>>>
> >>>> AM4: Is the telemetry name for `unsent-requests-queue-size` intended
> >>>> as `org.apache.kafka.consumer.unsent.requests.size`,
> >>>> or it should be corrected to `
> >>>> org.apache.kafka.consumer.unsent.requests.queue.size`?
> >>>>
> >>>> AM2:
> >>>> Regards,
> >>>> Apoorv Mittal
> >>>> +44 7721681581
> >>>>
> >>>>
> >>>> On Mon, Jul 15, 2024 at 2:45 PM Andrew Schofield <
> >>>> andrew_schofi...@live.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Brenden,
> >>>>> Thanks for the updates.
> >>>>>
> >>>>> AS4. I see that you’ve added `.ms` to a bunch of the metrics reflecting
> >>>> the
> >>>>> fact that they’re measured in milliseconds. However, I observe that most
> >>>>> metrics
> >>>>> in Kafka that are measured in milliseconds, with some exceptions in 
> >>>>> Kafka
> >>>>> Connect
> >>>>> and MirrorMaker do not follow this convention. I would tend to err on 
> >>>>> the
> >>>>> side of
> >>>>> consistency with the existing metrics and not use `.ms`. However, that’s
> >>>>> just my
> >>>>> opinion, so I’d be interested to know what other reviewers of the KIP
> >>>>> think.
> >>>>>
> >>>>> Thanks,
> >>>>> Andrew
> >>>>>
> >>>>>> On 12 Jul 2024, at 20:11, Brenden Deluna <bdel...@confluent.io.INVALID
> >>>>>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hey Lianet,
> >>>>>>
> >>>>>> Thank you for your suggestions and feedback!
> >>>>>>
> >>>>>>
> >>>>>> LM1. This has now been addressed.
> >>>>>>
> >>>>>>
> >>>>>> LM2. I think that would be a valuable addition to the current set of
> >>>>>> metrics, I will get that added.
> >>>>>>
> >>>>>>
> >>>>>> LM3. Again great idea, that would certainly be helpful. Will add that
> >>>> as
> >>>>>> well.
> >>>>>>
> >>>>>>
> >>>>>> Let me know if you have any more suggestions!
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Brenden
> >>>>>>
> >>>>>> On Fri, Jul 12, 2024 at 2:11 PM Brenden Deluna <bdel...@confluent.io>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Hi Lucas,
> >>>>>>>
> >>>>>>> Thank you for the feedback! I have addressed your comments:
> >>>>>>>
> >>>>>>>
> >>>>>>> LB1. Good catch there, I will update the names as needed.
> >>>>>>>
> >>>>>>>
> >>>>>>> LB2. Good catch again! I will update the name to be more consistent.
> >>>>>>>
> >>>>>>>
> >>>>>>> LB3. Thank you for pointing this out, I realized that all metric
> >>>> values
> >>>>>>> will actually be set to 0. I will specifiy this and explain why they
> >>>>> will
> >>>>>>> be 0.
> >>>>>>>
> >>>>>>>
> >>>>>>> Nit: This metric is referring to the queue of unsent requests in the
> >>>>>>> NetworkClientDelegate. For the metric descriptions I am trying to not
> >>>>>>> include too much of the implementation details, hence the reason that
> >>>>>>> description is quite short. I cannot think of other ways to describe
> >>>> the
> >>>>>>> metric without going deeper into the implementation, but please do let
> >>>>> me
> >>>>>>> know if you have any ideas.
> >>>>>>>
> >>>>>>>
> >>>>>>> Thank you,
> >>>>>>>
> >>>>>>> Brenden
> >>>>>>>
> >>>>>>> On Fri, Jul 12, 2024 at 1:27 PM Lianet M. <liane...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Hey Brenden, thanks for the KIP! Great to get more visibility into
> >>>> the
> >>>>> new
> >>>>>>>> consumer.
> >>>>>>>>
> >>>>>>>> LM1. +1 on Lucas's suggestion for including the unit in the name,
> >>>> seems
> >>>>>>>> clearer and consistent (I do see several time metrics including ms)
> >>>>>>>>
> >>>>>>>> LM2. What about a new metric for application-event-queue-time-ms. It
> >>>>> would
> >>>>>>>> be a complement to the application-event-queue-size you're proposing,
> >>>>> and
> >>>>>>>> it will tell us how long the events sit in the queue, waiting to be
> >>>>>>>> processed (from the time the API call adds the event to the queue, to
> >>>>> the
> >>>>>>>> time it's processed in the background thread). I find it would be
> >>>> very
> >>>>>>>> interesting.
> >>>>>>>>
> >>>>>>>> LM3. Thinking about the actual usage of
> >>>>>>>> "time-between-network-thread-poll-xxx" metric, I imagine it would be
> >>>>>>>> helpful to know more about what could be impacting it. As I see it,
> >>>> the
> >>>>>>>> network thread cadence could be mainly impacted by: 1- app event
> >>>>>>>> processing
> >>>>>>>> (generate requests), 2- network client poll (actual send/receive).
> >>>> For
> >>>>> 2,
> >>>>>>>> the new consumer reuses the same component as the legacy one, but 1
> >>>> is
> >>>>>>>> specific to the new consumer, so what about a metric
> >>>>>>>> for application-event-processing-time-ms (we could consider avg I
> >>>> would
> >>>>>>>> say). It would be the time that the network thread takes to process
> >>>> all
> >>>>>>>> available events on each run.
> >>>>>>>>
> >>>>>>>> Cheers!
> >>>>>>>> Lianet
> >>>>>>>>
> >>>>>>>> On Fri, Jul 12, 2024 at 1:57 PM Lucas Brutschy
> >>>>>>>> <lbruts...@confluent.io.invalid> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Brenden,
> >>>>>>>>>
> >>>>>>>>> thanks for the KIP! These will be great to observe and debug the
> >>>>>>>>> background thread of the new consumer.
> >>>>>>>>>
> >>>>>>>>> LB1. `time-between-network-thread-poll-max` → I see several similar
> >>>>>>>>> metrics including the unit in the metric name (ms or us). We could
> >>>>>>>>> consider this, although it's probably not strictly required.
> >>>> However,
> >>>>>>>>> at least the description should state the unit. Same for the `avg`
> >>>>>>>>> version.
> >>>>>>>>> LB2. `unsent-requests-size` → Naming sounds a bit like it's
> >>>> referring
> >>>>>>>>> to the size of the request. How about `unsent-request-queue-size` or
> >>>>>>>>> `unsent-request-count` or simply `unsent-requests`?
> >>>>>>>>> LB3. "the proposed metrics below will be set to null or 0." → which
> >>>>>>>>> one will be set to null and which ones will be set to 0, and why?
> >>>>>>>>>
> >>>>>>>>> nit: "The current number of unsent requests in the consumer
> >>>> network" →
> >>>>>>>>> Seems to be missing something?
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> Lucas
> >>>>>>>>>
> >>>>>>>>> On Fri, Jul 12, 2024 at 7:28 PM Brenden Deluna
> >>>>>>>>> <bdel...@confluent.io.invalid> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Andrew,
> >>>>>>>>>> Thank you for the feedback and your question.
> >>>>>>>>>>
> >>>>>>>>>> AS1. Great idea, I will get that added.
> >>>>>>>>>>
> >>>>>>>>>> AS2. For unsent-events-age-max, age will be calculated once the
> >>>> event
> >>>>>>>> is
> >>>>>>>>>> sent, so you are correct.
> >>>>>>>>>>
> >>>>>>>>>> AS3. I agree, I think that would be a helpful metric to add, thank
> >>>>>>>> you! I
> >>>>>>>>>> will get that added.
> >>>>>>>>>>
> >>>>>>>>>> Please let me know if you have any additional feedback,
> >>>> suggestions,
> >>>>>>>> or
> >>>>>>>>>> questions.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Brenden
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Jul 12, 2024 at 11:45 AM Andrew Schofield <
> >>>>>>>>> andrew_schofi...@live.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Brenden,
> >>>>>>>>>>> Thanks for the KIP. It fills a gap in the metrics for the new
> >>>>>>>> consumer
> >>>>>>>>>>> nicely.
> >>>>>>>>>>>
> >>>>>>>>>>> AS1. If using the CLASSIC protocol, and thus the
> >>>>>>>> LegacyKafkaConsumer,
> >>>>>>>>>>> I would expect that the metrics do not exist at all. Maybe say
> >>>>>>>>> something
> >>>>>>>>>>> like
> >>>>>>>>>>> “These metrics are for the new consumer implementation using the
> >>>>>>>>>>> CONSUMER protocol”.
> >>>>>>>>>>>
> >>>>>>>>>>> AS2. For unsent-events-age-max, when is the age calculated? For
> >>>>>>>>> example,
> >>>>>>>>>>> is it calculated at the time that the unsent event is removed from
> >>>>>>>> the
> >>>>>>>>>>> list and sent, or does the metric reflect unsent events which are
> >>>>>>>> still
> >>>>>>>>>>> enqueued? I suspect the former, but thought I’d check.
> >>>>>>>>>>>
> >>>>>>>>>>> AS3. I think that unsent-events-age-avg would also be interesting
> >>>> to
> >>>>>>>>>>> get an idea of how long unsent events tend to sit around before
> >>>>>>>>> sending.
> >>>>>>>>>>> Of course, the question from AS2 would also apply to the average.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Andrew
> >>>>>>>>>>>
> >>>>>>>>>>>> On 10 Jul 2024, at 17:44, Philip Nee <p...@confluent.io.INVALID>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>
> >>>>>>>>>>>> This is the link to the KIP document.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1068%3A+New+JMX+metrics+for+the+new+KafkaConsumer
> >>>>>>>>>>>>
> >>>>>>>>>>>> Any comment is appreciated,
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Jul 9, 2024 at 10:14 AM Brenden Deluna
> >>>>>>>>>>> <bdel...@confluent.io.invalid>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hello everyone,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I would like to start the discussion thread for KIP-1068. This
> >>>>>>>> is a
> >>>>>>>>>>>>> relatively small KIP, only proposing to add a couple of new
> >>>>>>>> metrics.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If you have any suggestions or feedback, let me know, it will be
> >>>>>>>>> much
> >>>>>>>>>>>>> appreciated.
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>

Re: [DISCUSS] KIP-1068: KIP-1068: New JMX Metrics for AsyncKafkaConsumer

Reply via email to