Hi, I echo the commend about symmetry. I expect additional use of the background events in AK 4.0, so adding these metrics while we are in here makes a lot of sense for that reason too.
Thanks, Andrew > On 23 Jul 2024, at 22:18, Lianet M. <liane...@gmail.com> wrote: > > Hey all, > > Follow-up on Bruno's point BC2. I personally did not suggest > background-event-queue-time-max and background-event-queue-time-avg mainly > because in practice we only have 2 types of events flowing from the > background to the app thread: errorEvents and callbackEvents, (vs the many > api-triggered events that flow from the app thread to the background). If > we get deeper into those, the error events are actually very light (from > the processing point of view), and the callbacks events already have their > specific metrics (inherited from the legacy consumer). So that was my > reasoning for not jumping right away into metrics for the background events > queue. > > That being said, I do like the symmetry Bruno mentions, and see the value > in having visibility on both queues. The app queue would probably be the > busiest but we need to consider that the background events set may grow as > we integrate other features like Queues (ex. with its ack commits > callbacks), so ok with me if we prefer to have a view on both queues. > > Lianet > > > On Tue, Jul 23, 2024 at 4:04 PM Brenden Deluna <bdel...@confluent.io.invalid> > wrote: > >> Hi Bruno, >> Thank you for your comments. >> >> BC1. I think that is a great point, I was not aware that we could not add >> the metrics based on which protocol is being used. I will update the KIP to >> reflect that change. >> >> BC2. There is not any specific reason for this, really it has just never >> been suggested in this thread. I will add them to get some opinions on if >> those would be useful and will go with the group consensus. >> >> Thank you, >> Brenden >> >> On Tue, Jul 23, 2024 at 11:03 AM Bruno Cadonna <cado...@apache.org> wrote: >> >>> Hi Brenden, >>> >>> BC1. In his first e-mail Andrew wrote "I would expect that the metrics >>> do not exist at all". I agree with him. I think it would be better to >>> not add those metrics at all if the CLASSIC protocol is used rather than >>> the metrics exist and are all constant 0. This should be possible by not >>> adding the metrics to the metrics registry if the CONSUMER protocol is >>> not used. >>> >>> BC2. Is there a specific reason you do not propose >>> background-event-queue-time-max and background-event-queue-time-avg? If >>> folk think those are not useful we do not need to add them. However, if >>> those are not useful, why is background-event-queue-size useful. I was >>> just wondering about the asymmetry between background-event-queue and >>> application-event-queue. >>> >>> Best, >>> Bruno >>> >>> >>> >>> On 7/19/24 9:14 PM, Brenden Deluna wrote: >>>> Hi Apoorv, >>>> Thank you for your comments, I will address each. >>>> >>>> AM1. I can see the usefulness in also having an >>>> 'application-event-queue-age-max' to get an idea of outliers and how >> they >>>> may be affecting the average metric. I will add that. >>>> >>>> AM2. I agree with you there, I think 'time' is a better descriptor here >>>> than 'age'. I will update those metric names as well. >>>> >>>> AM3. Similar to above comments, I will change the name of that metric >> to >>> be >>>> more consistent. And I think a max metric would also be useful here, >>> adding >>>> that. >>>> >>>> AM4. Yes, good catch there. Will update that as well. >>>> >>>> Thank you, >>>> Brenden >>>> >>>> On Fri, Jul 19, 2024 at 8:14 AM Apoorv Mittal < >> apoorvmitta...@gmail.com> >>>> wrote: >>>> >>>>> Hi Brendan, >>>>> Thanks for the KIP. The metrics are always helpful. >>>>> >>>>> AM1: Is `application-event-queue-age-avg` enough or do we require ` >>>>> application-event-queue-age-max` as well to differentiate with >> outliers? >>>>> >>>>> AM2: The kafka producer defines metric `record-queue-time-avg` which >>>>> captures the time spent in the buffer. Do you think we should have a >>>>> similar name for `application-event-queue-age-avg` i.e. change to ` >>>>> application-event-queue-time-avg`? Moreover other than similar naming, >>>>> `time` anyways seems more suitable than `age`, though minor. The >> `time` >>>>> usage is also aligned with the description of this metric. >>>>> >>>>> AM3: Metric `application-event-processing-time` says "the average >> time, >>>>> that the consumer network.....". Shall we have the `-avg` suffix in >> the >>>>> metric as we have defined for other metrics? Also do we require the >> max >>>>> metric as well for the same? >>>>> >>>>> AM4: Is the telemetry name for `unsent-requests-queue-size` intended >>>>> as `org.apache.kafka.consumer.unsent.requests.size`, >>>>> or it should be corrected to ` >>>>> org.apache.kafka.consumer.unsent.requests.queue.size`? >>>>> >>>>> AM2: >>>>> Regards, >>>>> Apoorv Mittal >>>>> +44 7721681581 >>>>> >>>>> >>>>> On Mon, Jul 15, 2024 at 2:45 PM Andrew Schofield < >>>>> andrew_schofi...@live.com> >>>>> wrote: >>>>> >>>>>> Hi Brenden, >>>>>> Thanks for the updates. >>>>>> >>>>>> AS4. I see that you’ve added `.ms` to a bunch of the metrics >> reflecting >>>>> the >>>>>> fact that they’re measured in milliseconds. However, I observe that >>> most >>>>>> metrics >>>>>> in Kafka that are measured in milliseconds, with some exceptions in >>> Kafka >>>>>> Connect >>>>>> and MirrorMaker do not follow this convention. I would tend to err on >>> the >>>>>> side of >>>>>> consistency with the existing metrics and not use `.ms`. However, >>> that’s >>>>>> just my >>>>>> opinion, so I’d be interested to know what other reviewers of the KIP >>>>>> think. >>>>>> >>>>>> Thanks, >>>>>> Andrew >>>>>> >>>>>>> On 12 Jul 2024, at 20:11, Brenden Deluna >> <bdel...@confluent.io.INVALID >>>>>> >>>>>> wrote: >>>>>>> >>>>>>> Hey Lianet, >>>>>>> >>>>>>> Thank you for your suggestions and feedback! >>>>>>> >>>>>>> >>>>>>> LM1. This has now been addressed. >>>>>>> >>>>>>> >>>>>>> LM2. I think that would be a valuable addition to the current set of >>>>>>> metrics, I will get that added. >>>>>>> >>>>>>> >>>>>>> LM3. Again great idea, that would certainly be helpful. Will add >> that >>>>> as >>>>>>> well. >>>>>>> >>>>>>> >>>>>>> Let me know if you have any more suggestions! >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Brenden >>>>>>> >>>>>>> On Fri, Jul 12, 2024 at 2:11 PM Brenden Deluna < >> bdel...@confluent.io> >>>>>> wrote: >>>>>>> >>>>>>>> Hi Lucas, >>>>>>>> >>>>>>>> Thank you for the feedback! I have addressed your comments: >>>>>>>> >>>>>>>> >>>>>>>> LB1. Good catch there, I will update the names as needed. >>>>>>>> >>>>>>>> >>>>>>>> LB2. Good catch again! I will update the name to be more >> consistent. >>>>>>>> >>>>>>>> >>>>>>>> LB3. Thank you for pointing this out, I realized that all metric >>>>> values >>>>>>>> will actually be set to 0. I will specifiy this and explain why >> they >>>>>> will >>>>>>>> be 0. >>>>>>>> >>>>>>>> >>>>>>>> Nit: This metric is referring to the queue of unsent requests in >> the >>>>>>>> NetworkClientDelegate. For the metric descriptions I am trying to >> not >>>>>>>> include too much of the implementation details, hence the reason >> that >>>>>>>> description is quite short. I cannot think of other ways to >> describe >>>>> the >>>>>>>> metric without going deeper into the implementation, but please do >>> let >>>>>> me >>>>>>>> know if you have any ideas. >>>>>>>> >>>>>>>> >>>>>>>> Thank you, >>>>>>>> >>>>>>>> Brenden >>>>>>>> >>>>>>>> On Fri, Jul 12, 2024 at 1:27 PM Lianet M. <liane...@gmail.com> >>> wrote: >>>>>>>> >>>>>>>>> Hey Brenden, thanks for the KIP! Great to get more visibility into >>>>> the >>>>>> new >>>>>>>>> consumer. >>>>>>>>> >>>>>>>>> LM1. +1 on Lucas's suggestion for including the unit in the name, >>>>> seems >>>>>>>>> clearer and consistent (I do see several time metrics including >> ms) >>>>>>>>> >>>>>>>>> LM2. What about a new metric for application-event-queue-time-ms. >> It >>>>>> would >>>>>>>>> be a complement to the application-event-queue-size you're >>> proposing, >>>>>> and >>>>>>>>> it will tell us how long the events sit in the queue, waiting to >> be >>>>>>>>> processed (from the time the API call adds the event to the queue, >>> to >>>>>> the >>>>>>>>> time it's processed in the background thread). I find it would be >>>>> very >>>>>>>>> interesting. >>>>>>>>> >>>>>>>>> LM3. Thinking about the actual usage of >>>>>>>>> "time-between-network-thread-poll-xxx" metric, I imagine it would >> be >>>>>>>>> helpful to know more about what could be impacting it. As I see >> it, >>>>> the >>>>>>>>> network thread cadence could be mainly impacted by: 1- app event >>>>>>>>> processing >>>>>>>>> (generate requests), 2- network client poll (actual send/receive). >>>>> For >>>>>> 2, >>>>>>>>> the new consumer reuses the same component as the legacy one, but >> 1 >>>>> is >>>>>>>>> specific to the new consumer, so what about a metric >>>>>>>>> for application-event-processing-time-ms (we could consider avg I >>>>> would >>>>>>>>> say). It would be the time that the network thread takes to >> process >>>>> all >>>>>>>>> available events on each run. >>>>>>>>> >>>>>>>>> Cheers! >>>>>>>>> Lianet >>>>>>>>> >>>>>>>>> On Fri, Jul 12, 2024 at 1:57 PM Lucas Brutschy >>>>>>>>> <lbruts...@confluent.io.invalid> wrote: >>>>>>>>> >>>>>>>>>> Hey Brenden, >>>>>>>>>> >>>>>>>>>> thanks for the KIP! These will be great to observe and debug the >>>>>>>>>> background thread of the new consumer. >>>>>>>>>> >>>>>>>>>> LB1. `time-between-network-thread-poll-max` → I see several >> similar >>>>>>>>>> metrics including the unit in the metric name (ms or us). We >> could >>>>>>>>>> consider this, although it's probably not strictly required. >>>>> However, >>>>>>>>>> at least the description should state the unit. Same for the >> `avg` >>>>>>>>>> version. >>>>>>>>>> LB2. `unsent-requests-size` → Naming sounds a bit like it's >>>>> referring >>>>>>>>>> to the size of the request. How about `unsent-request-queue-size` >>> or >>>>>>>>>> `unsent-request-count` or simply `unsent-requests`? >>>>>>>>>> LB3. "the proposed metrics below will be set to null or 0." → >> which >>>>>>>>>> one will be set to null and which ones will be set to 0, and why? >>>>>>>>>> >>>>>>>>>> nit: "The current number of unsent requests in the consumer >>>>> network" → >>>>>>>>>> Seems to be missing something? >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Lucas >>>>>>>>>> >>>>>>>>>> On Fri, Jul 12, 2024 at 7:28 PM Brenden Deluna >>>>>>>>>> <bdel...@confluent.io.invalid> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Andrew, >>>>>>>>>>> Thank you for the feedback and your question. >>>>>>>>>>> >>>>>>>>>>> AS1. Great idea, I will get that added. >>>>>>>>>>> >>>>>>>>>>> AS2. For unsent-events-age-max, age will be calculated once the >>>>> event >>>>>>>>> is >>>>>>>>>>> sent, so you are correct. >>>>>>>>>>> >>>>>>>>>>> AS3. I agree, I think that would be a helpful metric to add, >> thank >>>>>>>>> you! I >>>>>>>>>>> will get that added. >>>>>>>>>>> >>>>>>>>>>> Please let me know if you have any additional feedback, >>>>> suggestions, >>>>>>>>> or >>>>>>>>>>> questions. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Brenden >>>>>>>>>>> >>>>>>>>>>> On Fri, Jul 12, 2024 at 11:45 AM Andrew Schofield < >>>>>>>>>> andrew_schofi...@live.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Brenden, >>>>>>>>>>>> Thanks for the KIP. It fills a gap in the metrics for the new >>>>>>>>> consumer >>>>>>>>>>>> nicely. >>>>>>>>>>>> >>>>>>>>>>>> AS1. If using the CLASSIC protocol, and thus the >>>>>>>>> LegacyKafkaConsumer, >>>>>>>>>>>> I would expect that the metrics do not exist at all. Maybe say >>>>>>>>>> something >>>>>>>>>>>> like >>>>>>>>>>>> “These metrics are for the new consumer implementation using >> the >>>>>>>>>>>> CONSUMER protocol”. >>>>>>>>>>>> >>>>>>>>>>>> AS2. For unsent-events-age-max, when is the age calculated? For >>>>>>>>>> example, >>>>>>>>>>>> is it calculated at the time that the unsent event is removed >>> from >>>>>>>>> the >>>>>>>>>>>> list and sent, or does the metric reflect unsent events which >> are >>>>>>>>> still >>>>>>>>>>>> enqueued? I suspect the former, but thought I’d check. >>>>>>>>>>>> >>>>>>>>>>>> AS3. I think that unsent-events-age-avg would also be >> interesting >>>>> to >>>>>>>>>>>> get an idea of how long unsent events tend to sit around before >>>>>>>>>> sending. >>>>>>>>>>>> Of course, the question from AS2 would also apply to the >> average. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Andrew >>>>>>>>>>>> >>>>>>>>>>>>> On 10 Jul 2024, at 17:44, Philip Nee >> <p...@confluent.io.INVALID >>>> >>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi all, >>>>>>>>>>>>> >>>>>>>>>>>>> This is the link to the KIP document. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>> >>>>> >>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1068%3A+New+JMX+metrics+for+the+new+KafkaConsumer >>>>>>>>>>>>> >>>>>>>>>>>>> Any comment is appreciated, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jul 9, 2024 at 10:14 AM Brenden Deluna >>>>>>>>>>>> <bdel...@confluent.io.invalid> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hello everyone, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would like to start the discussion thread for KIP-1068. >> This >>>>>>>>> is a >>>>>>>>>>>>>> relatively small KIP, only proposing to add a couple of new >>>>>>>>> metrics. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If you have any suggestions or feedback, let me know, it will >>> be >>>>>>>>>> much >>>>>>>>>>>>>> appreciated. >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>