Re: [DISCUSS] KIP-1068: KIP-1068: New JMX Metrics for AsyncKafkaConsumer

Brenden Deluna Tue, 23 Jul 2024 13:01:48 -0700

Hi Bruno,
Thank you for your comments.

BC1. I think that is a great point, I was not aware that we could not add
the metrics based on which protocol is being used. I will update the KIP to
reflect that change.


BC2. There is not any specific reason for this, really it has just never
been suggested in this thread. I will add them to get some opinions on if
those would be useful and will go with the group consensus.

Thank you,
Brenden

On Tue, Jul 23, 2024 at 11:03 AM Bruno Cadonna <cado...@apache.org> wrote:

> Hi Brenden,
>
> BC1. In his first e-mail Andrew wrote "I would expect that the metrics
> do not exist at all". I agree with him. I think it would be better to
> not add those metrics at all if the CLASSIC protocol is used rather than
> the metrics exist and are all constant 0. This should be possible by not
> adding the metrics to the metrics registry if the CONSUMER protocol is
> not used.
>
> BC2. Is there a specific reason you do not propose
> background-event-queue-time-max and background-event-queue-time-avg? If
> folk think those are not useful we do not need to add them. However, if
> those are not useful, why is background-event-queue-size useful. I was
> just wondering about the asymmetry between background-event-queue and
> application-event-queue.
>
> Best,
> Bruno
>
>
>
> On 7/19/24 9:14 PM, Brenden Deluna wrote:
> > Hi Apoorv,
> > Thank you for your comments, I will address each.
> >
> > AM1. I can see the usefulness in also having an
> > 'application-event-queue-age-max' to get an idea of outliers and how they
> > may be affecting the average metric. I will add that.
> >
> > AM2. I agree with you there, I think 'time' is a better descriptor here
> > than 'age'. I will update those metric names as well.
> >
> > AM3. Similar to above comments, I will change the name of that metric to
> be
> > more consistent. And I think a max metric would also be useful here,
> adding
> > that.
> >
> > AM4. Yes, good catch there. Will update that as well.
> >
> > Thank you,
> > Brenden
> >
> > On Fri, Jul 19, 2024 at 8:14 AM Apoorv Mittal <apoorvmitta...@gmail.com>
> > wrote:
> >
> >> Hi Brendan,
> >> Thanks for the KIP. The metrics are always helpful.
> >>
> >> AM1: Is `application-event-queue-age-avg` enough or do we require `
> >> application-event-queue-age-max` as well to differentiate with outliers?
> >>
> >> AM2: The kafka producer defines metric `record-queue-time-avg` which
> >> captures the time spent in the buffer. Do you think we should have a
> >> similar name for `application-event-queue-age-avg` i.e. change to `
> >> application-event-queue-time-avg`? Moreover other than similar naming,
> >> `time` anyways seems more suitable than `age`, though minor. The `time`
> >> usage is also aligned with the description of this metric.
> >>
> >> AM3: Metric `application-event-processing-time` says "the average time,
> >> that the consumer network.....". Shall we have the `-avg` suffix in the
> >> metric as we have defined for other metrics? Also do we require the max
> >> metric as well for the same?
> >>
> >> AM4: Is the telemetry name for `unsent-requests-queue-size` intended
> >> as `org.apache.kafka.consumer.unsent.requests.size`,
> >> or it should be corrected to `
> >> org.apache.kafka.consumer.unsent.requests.queue.size`?
> >>
> >> AM2:
> >> Regards,
> >> Apoorv Mittal
> >> +44 7721681581
> >>
> >>
> >> On Mon, Jul 15, 2024 at 2:45 PM Andrew Schofield <
> >> andrew_schofi...@live.com>
> >> wrote:
> >>
> >>> Hi Brenden,
> >>> Thanks for the updates.
> >>>
> >>> AS4. I see that you’ve added `.ms` to a bunch of the metrics reflecting
> >> the
> >>> fact that they’re measured in milliseconds. However, I observe that
> most
> >>> metrics
> >>> in Kafka that are measured in milliseconds, with some exceptions in
> Kafka
> >>> Connect
> >>> and MirrorMaker do not follow this convention. I would tend to err on
> the
> >>> side of
> >>> consistency with the existing metrics and not use `.ms`. However,
> that’s
> >>> just my
> >>> opinion, so I’d be interested to know what other reviewers of the KIP
> >>> think.
> >>>
> >>> Thanks,
> >>> Andrew
> >>>
> >>>> On 12 Jul 2024, at 20:11, Brenden Deluna <bdel...@confluent.io.INVALID
> >>>
> >>> wrote:
> >>>>
> >>>> Hey Lianet,
> >>>>
> >>>> Thank you for your suggestions and feedback!
> >>>>
> >>>>
> >>>> LM1. This has now been addressed.
> >>>>
> >>>>
> >>>> LM2. I think that would be a valuable addition to the current set of
> >>>> metrics, I will get that added.
> >>>>
> >>>>
> >>>> LM3. Again great idea, that would certainly be helpful. Will add that
> >> as
> >>>> well.
> >>>>
> >>>>
> >>>> Let me know if you have any more suggestions!
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Brenden
> >>>>
> >>>> On Fri, Jul 12, 2024 at 2:11 PM Brenden Deluna <bdel...@confluent.io>
> >>> wrote:
> >>>>
> >>>>> Hi Lucas,
> >>>>>
> >>>>> Thank you for the feedback! I have addressed your comments:
> >>>>>
> >>>>>
> >>>>> LB1. Good catch there, I will update the names as needed.
> >>>>>
> >>>>>
> >>>>> LB2. Good catch again! I will update the name to be more consistent.
> >>>>>
> >>>>>
> >>>>> LB3. Thank you for pointing this out, I realized that all metric
> >> values
> >>>>> will actually be set to 0. I will specifiy this and explain why they
> >>> will
> >>>>> be 0.
> >>>>>
> >>>>>
> >>>>> Nit: This metric is referring to the queue of unsent requests in the
> >>>>> NetworkClientDelegate. For the metric descriptions I am trying to not
> >>>>> include too much of the implementation details, hence the reason that
> >>>>> description is quite short. I cannot think of other ways to describe
> >> the
> >>>>> metric without going deeper into the implementation, but please do
> let
> >>> me
> >>>>> know if you have any ideas.
> >>>>>
> >>>>>
> >>>>> Thank you,
> >>>>>
> >>>>> Brenden
> >>>>>
> >>>>> On Fri, Jul 12, 2024 at 1:27 PM Lianet M. <liane...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hey Brenden, thanks for the KIP! Great to get more visibility into
> >> the
> >>> new
> >>>>>> consumer.
> >>>>>>
> >>>>>> LM1. +1 on Lucas's suggestion for including the unit in the name,
> >> seems
> >>>>>> clearer and consistent (I do see several time metrics including ms)
> >>>>>>
> >>>>>> LM2. What about a new metric for application-event-queue-time-ms. It
> >>> would
> >>>>>> be a complement to the application-event-queue-size you're
> proposing,
> >>> and
> >>>>>> it will tell us how long the events sit in the queue, waiting to be
> >>>>>> processed (from the time the API call adds the event to the queue,
> to
> >>> the
> >>>>>> time it's processed in the background thread). I find it would be
> >> very
> >>>>>> interesting.
> >>>>>>
> >>>>>> LM3. Thinking about the actual usage of
> >>>>>> "time-between-network-thread-poll-xxx" metric, I imagine it would be
> >>>>>> helpful to know more about what could be impacting it. As I see it,
> >> the
> >>>>>> network thread cadence could be mainly impacted by: 1- app event
> >>>>>> processing
> >>>>>> (generate requests), 2- network client poll (actual send/receive).
> >> For
> >>> 2,
> >>>>>> the new consumer reuses the same component as the legacy one, but 1
> >> is
> >>>>>> specific to the new consumer, so what about a metric
> >>>>>> for application-event-processing-time-ms (we could consider avg I
> >> would
> >>>>>> say). It would be the time that the network thread takes to process
> >> all
> >>>>>> available events on each run.
> >>>>>>
> >>>>>> Cheers!
> >>>>>> Lianet
> >>>>>>
> >>>>>> On Fri, Jul 12, 2024 at 1:57 PM Lucas Brutschy
> >>>>>> <lbruts...@confluent.io.invalid> wrote:
> >>>>>>
> >>>>>>> Hey Brenden,
> >>>>>>>
> >>>>>>> thanks for the KIP! These will be great to observe and debug the
> >>>>>>> background thread of the new consumer.
> >>>>>>>
> >>>>>>> LB1. `time-between-network-thread-poll-max` → I see several similar
> >>>>>>> metrics including the unit in the metric name (ms or us). We could
> >>>>>>> consider this, although it's probably not strictly required.
> >> However,
> >>>>>>> at least the description should state the unit. Same for the `avg`
> >>>>>>> version.
> >>>>>>> LB2. `unsent-requests-size` → Naming sounds a bit like it's
> >> referring
> >>>>>>> to the size of the request. How about `unsent-request-queue-size`
> or
> >>>>>>> `unsent-request-count` or simply `unsent-requests`?
> >>>>>>> LB3. "the proposed metrics below will be set to null or 0." → which
> >>>>>>> one will be set to null and which ones will be set to 0, and why?
> >>>>>>>
> >>>>>>> nit: "The current number of unsent requests in the consumer
> >> network" →
> >>>>>>> Seems to be missing something?
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Lucas
> >>>>>>>
> >>>>>>> On Fri, Jul 12, 2024 at 7:28 PM Brenden Deluna
> >>>>>>> <bdel...@confluent.io.invalid> wrote:
> >>>>>>>>
> >>>>>>>> Hi Andrew,
> >>>>>>>> Thank you for the feedback and your question.
> >>>>>>>>
> >>>>>>>> AS1. Great idea, I will get that added.
> >>>>>>>>
> >>>>>>>> AS2. For unsent-events-age-max, age will be calculated once the
> >> event
> >>>>>> is
> >>>>>>>> sent, so you are correct.
> >>>>>>>>
> >>>>>>>> AS3. I agree, I think that would be a helpful metric to add, thank
> >>>>>> you! I
> >>>>>>>> will get that added.
> >>>>>>>>
> >>>>>>>> Please let me know if you have any additional feedback,
> >> suggestions,
> >>>>>> or
> >>>>>>>> questions.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Brenden
> >>>>>>>>
> >>>>>>>> On Fri, Jul 12, 2024 at 11:45 AM Andrew Schofield <
> >>>>>>> andrew_schofi...@live.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Brenden,
> >>>>>>>>> Thanks for the KIP. It fills a gap in the metrics for the new
> >>>>>> consumer
> >>>>>>>>> nicely.
> >>>>>>>>>
> >>>>>>>>> AS1. If using the CLASSIC protocol, and thus the
> >>>>>> LegacyKafkaConsumer,
> >>>>>>>>> I would expect that the metrics do not exist at all. Maybe say
> >>>>>>> something
> >>>>>>>>> like
> >>>>>>>>> “These metrics are for the new consumer implementation using the
> >>>>>>>>> CONSUMER protocol”.
> >>>>>>>>>
> >>>>>>>>> AS2. For unsent-events-age-max, when is the age calculated? For
> >>>>>>> example,
> >>>>>>>>> is it calculated at the time that the unsent event is removed
> from
> >>>>>> the
> >>>>>>>>> list and sent, or does the metric reflect unsent events which are
> >>>>>> still
> >>>>>>>>> enqueued? I suspect the former, but thought I’d check.
> >>>>>>>>>
> >>>>>>>>> AS3. I think that unsent-events-age-avg would also be interesting
> >> to
> >>>>>>>>> get an idea of how long unsent events tend to sit around before
> >>>>>>> sending.
> >>>>>>>>> Of course, the question from AS2 would also apply to the average.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Andrew
> >>>>>>>>>
> >>>>>>>>>> On 10 Jul 2024, at 17:44, Philip Nee <p...@confluent.io.INVALID
> >
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi all,
> >>>>>>>>>>
> >>>>>>>>>> This is the link to the KIP document.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1068%3A+New+JMX+metrics+for+the+new+KafkaConsumer
> >>>>>>>>>>
> >>>>>>>>>> Any comment is appreciated,
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Jul 9, 2024 at 10:14 AM Brenden Deluna
> >>>>>>>>> <bdel...@confluent.io.invalid>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hello everyone,
> >>>>>>>>>>>
> >>>>>>>>>>> I would like to start the discussion thread for KIP-1068. This
> >>>>>> is a
> >>>>>>>>>>> relatively small KIP, only proposing to add a couple of new
> >>>>>> metrics.
> >>>>>>>>>>>
> >>>>>>>>>>> If you have any suggestions or feedback, let me know, it will
> be
> >>>>>>> much
> >>>>>>>>>>> appreciated.
> >>>
> >>>
> >>>
> >>
> >
>

Re: [DISCUSS] KIP-1068: KIP-1068: New JMX Metrics for AsyncKafkaConsumer

Reply via email to