Re: [DISCUSS] KIP-1068: KIP-1068: New JMX Metrics for AsyncKafkaConsumer

Andrew Schofield Wed, 24 Jul 2024 02:01:54 -0700

Hi,
I echo the commend about symmetry. I expect additional use of the background
events in AK 4.0, so adding these metrics while we are in here makes a lot of 
sense
for that reason too.


Thanks,
Andrew

> On 23 Jul 2024, at 22:18, Lianet M. <liane...@gmail.com> wrote:
>
> Hey all,
>
> Follow-up on Bruno's point BC2. I personally did not suggest
> background-event-queue-time-max and background-event-queue-time-avg mainly
> because in practice we only have 2 types of events flowing from the
> background to the app thread: errorEvents and callbackEvents, (vs the many
> api-triggered events that flow from the app thread to the background). If
> we get deeper into those, the error events are actually very light (from
> the processing point of view), and the callbacks events already have their
> specific metrics (inherited from the legacy consumer). So that was my
> reasoning for not jumping right away into metrics for the background events
> queue.
>
> That being said, I do like the symmetry Bruno mentions, and see the value
> in having visibility on both queues. The app queue would probably be the
> busiest but we need to consider that the background events set may grow as
> we integrate other features like Queues (ex. with its ack commits
> callbacks), so ok with me if we prefer to have a view on both queues.
>
> Lianet
>
>
> On Tue, Jul 23, 2024 at 4:04 PM Brenden Deluna <bdel...@confluent.io.invalid>
> wrote:
>
>> Hi Bruno,
>> Thank you for your comments.
>>
>> BC1. I think that is a great point, I was not aware that we could not add
>> the metrics based on which protocol is being used. I will update the KIP to
>> reflect that change.
>>
>> BC2. There is not any specific reason for this, really it has just never
>> been suggested in this thread. I will add them to get some opinions on if
>> those would be useful and will go with the group consensus.
>>
>> Thank you,
>> Brenden
>>
>> On Tue, Jul 23, 2024 at 11:03 AM Bruno Cadonna <cado...@apache.org> wrote:
>>
>>> Hi Brenden,
>>>
>>> BC1. In his first e-mail Andrew wrote "I would expect that the metrics
>>> do not exist at all". I agree with him. I think it would be better to
>>> not add those metrics at all if the CLASSIC protocol is used rather than
>>> the metrics exist and are all constant 0. This should be possible by not
>>> adding the metrics to the metrics registry if the CONSUMER protocol is
>>> not used.
>>>
>>> BC2. Is there a specific reason you do not propose
>>> background-event-queue-time-max and background-event-queue-time-avg? If
>>> folk think those are not useful we do not need to add them. However, if
>>> those are not useful, why is background-event-queue-size useful. I was
>>> just wondering about the asymmetry between background-event-queue and
>>> application-event-queue.
>>>
>>> Best,
>>> Bruno
>>>
>>>
>>>
>>> On 7/19/24 9:14 PM, Brenden Deluna wrote:
>>>> Hi Apoorv,
>>>> Thank you for your comments, I will address each.
>>>>
>>>> AM1. I can see the usefulness in also having an
>>>> 'application-event-queue-age-max' to get an idea of outliers and how
>> they
>>>> may be affecting the average metric. I will add that.
>>>>
>>>> AM2. I agree with you there, I think 'time' is a better descriptor here
>>>> than 'age'. I will update those metric names as well.
>>>>
>>>> AM3. Similar to above comments, I will change the name of that metric
>> to
>>> be
>>>> more consistent. And I think a max metric would also be useful here,
>>> adding
>>>> that.
>>>>
>>>> AM4. Yes, good catch there. Will update that as well.
>>>>
>>>> Thank you,
>>>> Brenden
>>>>
>>>> On Fri, Jul 19, 2024 at 8:14 AM Apoorv Mittal <
>> apoorvmitta...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Brendan,
>>>>> Thanks for the KIP. The metrics are always helpful.
>>>>>
>>>>> AM1: Is `application-event-queue-age-avg` enough or do we require `
>>>>> application-event-queue-age-max` as well to differentiate with
>> outliers?
>>>>>
>>>>> AM2: The kafka producer defines metric `record-queue-time-avg` which
>>>>> captures the time spent in the buffer. Do you think we should have a
>>>>> similar name for `application-event-queue-age-avg` i.e. change to `
>>>>> application-event-queue-time-avg`? Moreover other than similar naming,
>>>>> `time` anyways seems more suitable than `age`, though minor. The
>> `time`
>>>>> usage is also aligned with the description of this metric.
>>>>>
>>>>> AM3: Metric `application-event-processing-time` says "the average
>> time,
>>>>> that the consumer network.....". Shall we have the `-avg` suffix in
>> the
>>>>> metric as we have defined for other metrics? Also do we require the
>> max
>>>>> metric as well for the same?
>>>>>
>>>>> AM4: Is the telemetry name for `unsent-requests-queue-size` intended
>>>>> as `org.apache.kafka.consumer.unsent.requests.size`,
>>>>> or it should be corrected to `
>>>>> org.apache.kafka.consumer.unsent.requests.queue.size`?
>>>>>
>>>>> AM2:
>>>>> Regards,
>>>>> Apoorv Mittal
>>>>> +44 7721681581
>>>>>
>>>>>
>>>>> On Mon, Jul 15, 2024 at 2:45 PM Andrew Schofield <
>>>>> andrew_schofi...@live.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Brenden,
>>>>>> Thanks for the updates.
>>>>>>
>>>>>> AS4. I see that you’ve added `.ms` to a bunch of the metrics
>> reflecting
>>>>> the
>>>>>> fact that they’re measured in milliseconds. However, I observe that
>>> most
>>>>>> metrics
>>>>>> in Kafka that are measured in milliseconds, with some exceptions in
>>> Kafka
>>>>>> Connect
>>>>>> and MirrorMaker do not follow this convention. I would tend to err on
>>> the
>>>>>> side of
>>>>>> consistency with the existing metrics and not use `.ms`. However,
>>> that’s
>>>>>> just my
>>>>>> opinion, so I’d be interested to know what other reviewers of the KIP
>>>>>> think.
>>>>>>
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>>> On 12 Jul 2024, at 20:11, Brenden Deluna
>> <bdel...@confluent.io.INVALID
>>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hey Lianet,
>>>>>>>
>>>>>>> Thank you for your suggestions and feedback!
>>>>>>>
>>>>>>>
>>>>>>> LM1. This has now been addressed.
>>>>>>>
>>>>>>>
>>>>>>> LM2. I think that would be a valuable addition to the current set of
>>>>>>> metrics, I will get that added.
>>>>>>>
>>>>>>>
>>>>>>> LM3. Again great idea, that would certainly be helpful. Will add
>> that
>>>>> as
>>>>>>> well.
>>>>>>>
>>>>>>>
>>>>>>> Let me know if you have any more suggestions!
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Brenden
>>>>>>>
>>>>>>> On Fri, Jul 12, 2024 at 2:11 PM Brenden Deluna <
>> bdel...@confluent.io>
>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Lucas,
>>>>>>>>
>>>>>>>> Thank you for the feedback! I have addressed your comments:
>>>>>>>>
>>>>>>>>
>>>>>>>> LB1. Good catch there, I will update the names as needed.
>>>>>>>>
>>>>>>>>
>>>>>>>> LB2. Good catch again! I will update the name to be more
>> consistent.
>>>>>>>>
>>>>>>>>
>>>>>>>> LB3. Thank you for pointing this out, I realized that all metric
>>>>> values
>>>>>>>> will actually be set to 0. I will specifiy this and explain why
>> they
>>>>>> will
>>>>>>>> be 0.
>>>>>>>>
>>>>>>>>
>>>>>>>> Nit: This metric is referring to the queue of unsent requests in
>> the
>>>>>>>> NetworkClientDelegate. For the metric descriptions I am trying to
>> not
>>>>>>>> include too much of the implementation details, hence the reason
>> that
>>>>>>>> description is quite short. I cannot think of other ways to
>> describe
>>>>> the
>>>>>>>> metric without going deeper into the implementation, but please do
>>> let
>>>>>> me
>>>>>>>> know if you have any ideas.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>>
>>>>>>>> Brenden
>>>>>>>>
>>>>>>>> On Fri, Jul 12, 2024 at 1:27 PM Lianet M. <liane...@gmail.com>
>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Brenden, thanks for the KIP! Great to get more visibility into
>>>>> the
>>>>>> new
>>>>>>>>> consumer.
>>>>>>>>>
>>>>>>>>> LM1. +1 on Lucas's suggestion for including the unit in the name,
>>>>> seems
>>>>>>>>> clearer and consistent (I do see several time metrics including
>> ms)
>>>>>>>>>
>>>>>>>>> LM2. What about a new metric for application-event-queue-time-ms.
>> It
>>>>>> would
>>>>>>>>> be a complement to the application-event-queue-size you're
>>> proposing,
>>>>>> and
>>>>>>>>> it will tell us how long the events sit in the queue, waiting to
>> be
>>>>>>>>> processed (from the time the API call adds the event to the queue,
>>> to
>>>>>> the
>>>>>>>>> time it's processed in the background thread). I find it would be
>>>>> very
>>>>>>>>> interesting.
>>>>>>>>>
>>>>>>>>> LM3. Thinking about the actual usage of
>>>>>>>>> "time-between-network-thread-poll-xxx" metric, I imagine it would
>> be
>>>>>>>>> helpful to know more about what could be impacting it. As I see
>> it,
>>>>> the
>>>>>>>>> network thread cadence could be mainly impacted by: 1- app event
>>>>>>>>> processing
>>>>>>>>> (generate requests), 2- network client poll (actual send/receive).
>>>>> For
>>>>>> 2,
>>>>>>>>> the new consumer reuses the same component as the legacy one, but
>> 1
>>>>> is
>>>>>>>>> specific to the new consumer, so what about a metric
>>>>>>>>> for application-event-processing-time-ms (we could consider avg I
>>>>> would
>>>>>>>>> say). It would be the time that the network thread takes to
>> process
>>>>> all
>>>>>>>>> available events on each run.
>>>>>>>>>
>>>>>>>>> Cheers!
>>>>>>>>> Lianet
>>>>>>>>>
>>>>>>>>> On Fri, Jul 12, 2024 at 1:57 PM Lucas Brutschy
>>>>>>>>> <lbruts...@confluent.io.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Brenden,
>>>>>>>>>>
>>>>>>>>>> thanks for the KIP! These will be great to observe and debug the
>>>>>>>>>> background thread of the new consumer.
>>>>>>>>>>
>>>>>>>>>> LB1. `time-between-network-thread-poll-max` → I see several
>> similar
>>>>>>>>>> metrics including the unit in the metric name (ms or us). We
>> could
>>>>>>>>>> consider this, although it's probably not strictly required.
>>>>> However,
>>>>>>>>>> at least the description should state the unit. Same for the
>> `avg`
>>>>>>>>>> version.
>>>>>>>>>> LB2. `unsent-requests-size` → Naming sounds a bit like it's
>>>>> referring
>>>>>>>>>> to the size of the request. How about `unsent-request-queue-size`
>>> or
>>>>>>>>>> `unsent-request-count` or simply `unsent-requests`?
>>>>>>>>>> LB3. "the proposed metrics below will be set to null or 0." →
>> which
>>>>>>>>>> one will be set to null and which ones will be set to 0, and why?
>>>>>>>>>>
>>>>>>>>>> nit: "The current number of unsent requests in the consumer
>>>>> network" →
>>>>>>>>>> Seems to be missing something?
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Lucas
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 12, 2024 at 7:28 PM Brenden Deluna
>>>>>>>>>> <bdel...@confluent.io.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>> Thank you for the feedback and your question.
>>>>>>>>>>>
>>>>>>>>>>> AS1. Great idea, I will get that added.
>>>>>>>>>>>
>>>>>>>>>>> AS2. For unsent-events-age-max, age will be calculated once the
>>>>> event
>>>>>>>>> is
>>>>>>>>>>> sent, so you are correct.
>>>>>>>>>>>
>>>>>>>>>>> AS3. I agree, I think that would be a helpful metric to add,
>> thank
>>>>>>>>> you! I
>>>>>>>>>>> will get that added.
>>>>>>>>>>>
>>>>>>>>>>> Please let me know if you have any additional feedback,
>>>>> suggestions,
>>>>>>>>> or
>>>>>>>>>>> questions.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Brenden
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jul 12, 2024 at 11:45 AM Andrew Schofield <
>>>>>>>>>> andrew_schofi...@live.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Brenden,
>>>>>>>>>>>> Thanks for the KIP. It fills a gap in the metrics for the new
>>>>>>>>> consumer
>>>>>>>>>>>> nicely.
>>>>>>>>>>>>
>>>>>>>>>>>> AS1. If using the CLASSIC protocol, and thus the
>>>>>>>>> LegacyKafkaConsumer,
>>>>>>>>>>>> I would expect that the metrics do not exist at all. Maybe say
>>>>>>>>>> something
>>>>>>>>>>>> like
>>>>>>>>>>>> “These metrics are for the new consumer implementation using
>> the
>>>>>>>>>>>> CONSUMER protocol”.
>>>>>>>>>>>>
>>>>>>>>>>>> AS2. For unsent-events-age-max, when is the age calculated? For
>>>>>>>>>> example,
>>>>>>>>>>>> is it calculated at the time that the unsent event is removed
>>> from
>>>>>>>>> the
>>>>>>>>>>>> list and sent, or does the metric reflect unsent events which
>> are
>>>>>>>>> still
>>>>>>>>>>>> enqueued? I suspect the former, but thought I’d check.
>>>>>>>>>>>>
>>>>>>>>>>>> AS3. I think that unsent-events-age-avg would also be
>> interesting
>>>>> to
>>>>>>>>>>>> get an idea of how long unsent events tend to sit around before
>>>>>>>>>> sending.
>>>>>>>>>>>> Of course, the question from AS2 would also apply to the
>> average.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Andrew
>>>>>>>>>>>>
>>>>>>>>>>>>> On 10 Jul 2024, at 17:44, Philip Nee
>> <p...@confluent.io.INVALID
>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is the link to the KIP document.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>
>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1068%3A+New+JMX+metrics+for+the+new+KafkaConsumer
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any comment is appreciated,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 9, 2024 at 10:14 AM Brenden Deluna
>>>>>>>>>>>> <bdel...@confluent.io.invalid>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to start the discussion thread for KIP-1068.
>> This
>>>>>>>>> is a
>>>>>>>>>>>>>> relatively small KIP, only proposing to add a couple of new
>>>>>>>>> metrics.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you have any suggestions or feedback, let me know, it will
>>> be
>>>>>>>>>> much
>>>>>>>>>>>>>> appreciated.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: [DISCUSS] KIP-1068: KIP-1068: New JMX Metrics for AsyncKafkaConsumer

Reply via email to