Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
Thanks Rohan, I've reviewed the new PR and had a question regarding whether we should have the new metric in ms or ns, maybe we can first discuss about that before we finalize the KIP? Guozhang On Fri, Feb 25, 2022 at 5:48 AM Bruno Cadonna wrote: > Hi Rohan, > > Thank you for the heads up! > > Yes, please update the KIP and send this message also to the VOTE thread > of the KIP. > > Best, > Bruno > > On 25.02.22 04:01, Rohan Desai wrote: > > Hello, > > > > I discovered a bug in the design of this metric. The bug is documented > > here: https://github.com/apache/kafka/pull/11805. We need to include > time > > the producer spends waiting on topic metadata into the total blocked > time. > > But to do this we would need to add a new producer metric that tracks the > > total time spent blocked on metadata. I've implemented this in a patch > > here: https://github.com/apache/kafka/pull/11805 > > > > I'm hoping I can just update this KIP to include the new producer metric. > > > > On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai > > wrote: > > > >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams > >> > > > -- -- Guozhang
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
Hi Rohan, Thank you for the heads up! Yes, please update the KIP and send this message also to the VOTE thread of the KIP. Best, Bruno On 25.02.22 04:01, Rohan Desai wrote: Hello, I discovered a bug in the design of this metric. The bug is documented here: https://github.com/apache/kafka/pull/11805. We need to include time the producer spends waiting on topic metadata into the total blocked time. But to do this we would need to add a new producer metric that tracks the total time spent blocked on metadata. I've implemented this in a patch here: https://github.com/apache/kafka/pull/11805 I'm hoping I can just update this KIP to include the new producer metric. On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai wrote: https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
Hello, I discovered a bug in the design of this metric. The bug is documented here: https://github.com/apache/kafka/pull/11805. We need to include time the producer spends waiting on topic metadata into the total blocked time. But to do this we would need to add a new producer metric that tracks the total time spent blocked on metadata. I've implemented this in a patch here: https://github.com/apache/kafka/pull/11805 I'm hoping I can just update this KIP to include the new producer metric. On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai wrote: > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams >
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
Hi, With "group", I actually meant the "type" tag. A thread-level metrics in Streams has the following tag "type=stream-thread-metrics" which we also call "group" in the code. I realized now that I mentioned "type" and "group" in my previous e-mail which are synonyms for the same concept. For example, the blocked-time-total metric in Rohan's KIP should be defined as: blocked-time-total tags: type = stream-thread-metrics thread-id = [stream thread ID] level: INFO description: the total time the Kafka Streams thread spent blocked on Kafka. Best, Bruno On 27.07.21 10:13, Sophie Blee-Goldman wrote: Thanks for the clarifications, that all makes sense. I'm ready to vote on the KIP, but can you just update the KIP first to address Bruno's feedback? Ie just fix the tags and fill in the missing fields. For example it sounds like these would be thread-level metrics. You should be able to figure out what the values should be from the KIP-444 doc, there's a chart with the type and tags for all thread-level metrics. Not 100% sure what Bruno meant by "group" but my guess would be whether it's INFO/DEBUG/TRACE. This is probably one of the most important things to include in a KIP that's introducing new metrics: how big is the potential performance impact of recording these metrics? How big is the intended audience, would these be useful to almost everyone or are they more "niche"? It sounds like maybe DEBUG would be most appropriate here -- WDYT? -Sophie On Thu, Jul 22, 2021 at 9:01 AM Bruno Cadonna wrote: Hi Rohan, Thank you for the KIP! I agree that the KIP is well-motivated. What is not very clear is the metadata like type, group, and tags of the metrics. For example, there is not application-id tag in Streams and there is also no producer-id tag. The clients, i.e., producer, admin, consumer, and also Streams have a client-id tag, that corresponds to the producer-id, consumer-id, etc you use in the KIP. For examples of metadata used in Streams you can look at the following KIPs: - https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams - https://cwiki.apache.org/confluence/display/KAFKA/KIP-471%3A+Expose+RocksDB+Metrics+in+Kafka+Streams - https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB - https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams Best, Bruno On 22.07.21 09:42, Rohan Desai wrote: re sophie: The intent here was to include all blocked time (not just `RUNNING`). The caller can window the total blocked time themselves, and that can be compared with a timeseries of the state to understand the ratio in different states. I'll update the KIP to include `committed`. The admin API calls should be accounted for by the admin client iotime/iowaittime metrics. On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai wrote: I remember now that we moved the round-trip PID's txn completion logic into init-transaction and commit/abort-transaction. So I think we'd count time as in StreamsProducer#initTransaction as well (admittedly it is in most cases a one-time thing). Makes sense - I'll update the KIP On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai wrote: I had a question - it seems like from the descriptionsof `txn-commit-time-total` and `offset-commit-time-total` that they measure similar processes for ALOS and EOS, but only `txn-commit-time-total` is included in `blocked-time-total`. Why isn't `offset-commit-time-total` also included? I've updated the KIP to include it. Aside from `flush-time-total`, `txn-commit-time-total` and `offset-commit-time-total`, which will be producer/consumer client metrics, the rest of the metrics will be streams metrics that will be thread level, is that right? Based on the feedback from Guozhang, I've updated the KIP to reflect that the lower-level metrics are all client metrics that are then summed to compute the blocked time metric, which is a Streams metric. On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai wrote: Similarly, I think "txn-commit-time-total" and "offset-commit-time-total" may better be inside producer and consumer clients respectively. I agree for offset-commit-time-total. For txn-commit-time-total I'm proposing we measure `StreamsProducer.commitTransaction`, which wraps multiple producer calls (sendOffsets, commitTransaction) For "txn-commit-time-total" specifically, besides producer.commitTxn. other txn-related calls may also be blocking, including producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" later in the doc, but did not include it as a separate metric, and similarly, should we have a `txn-abort-time-total` as well? If yes, could you update the KIP page accordingly. `beginTransaction` is not blocking - I meant to remove that from that doc. I'll add something for abort. On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai wrote: T
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
Thanks for the clarifications, that all makes sense. I'm ready to vote on the KIP, but can you just update the KIP first to address Bruno's feedback? Ie just fix the tags and fill in the missing fields. For example it sounds like these would be thread-level metrics. You should be able to figure out what the values should be from the KIP-444 doc, there's a chart with the type and tags for all thread-level metrics. Not 100% sure what Bruno meant by "group" but my guess would be whether it's INFO/DEBUG/TRACE. This is probably one of the most important things to include in a KIP that's introducing new metrics: how big is the potential performance impact of recording these metrics? How big is the intended audience, would these be useful to almost everyone or are they more "niche"? It sounds like maybe DEBUG would be most appropriate here -- WDYT? -Sophie On Thu, Jul 22, 2021 at 9:01 AM Bruno Cadonna wrote: > Hi Rohan, > > Thank you for the KIP! > > I agree that the KIP is well-motivated. > > What is not very clear is the metadata like type, group, and tags of the > metrics. For example, there is not application-id tag in Streams and > there is also no producer-id tag. The clients, i.e., producer, admin, > consumer, and also Streams have a client-id tag, that corresponds to the > producer-id, consumer-id, etc you use in the KIP. > > For examples of metadata used in Streams you can look at the following > KIPs: > > - > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams > - > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-471%3A+Expose+RocksDB+Metrics+in+Kafka+Streams > - > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB > - > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams > > > Best, > Bruno > > On 22.07.21 09:42, Rohan Desai wrote: > > re sophie: > > > > The intent here was to include all blocked time (not just `RUNNING`). The > > caller can window the total blocked time themselves, and that can be > > compared with a timeseries of the state to understand the ratio in > > different states. I'll update the KIP to include `committed`. The admin > API > > calls should be accounted for by the admin client iotime/iowaittime > > metrics. > > > > On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai > > wrote: > > > >>> I remember now that we moved the round-trip PID's txn completion logic > >> into > >> init-transaction and commit/abort-transaction. So I think we'd count > time > >> as in StreamsProducer#initTransaction as well (admittedly it is in most > >> cases a one-time thing). > >> > >> Makes sense - I'll update the KIP > >> > >> On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai > >> wrote: > >> > >>> > I had a question - it seems like from the descriptionsof > >>> `txn-commit-time-total` and `offset-commit-time-total` that they > measure > >>> similar processes for ALOS and EOS, but only `txn-commit-time-total` is > >>> included in `blocked-time-total`. Why isn't `offset-commit-time-total` > also > >>> included? > >>> > >>> I've updated the KIP to include it. > >>> > Aside from `flush-time-total`, `txn-commit-time-total` and > >>> `offset-commit-time-total`, which will be producer/consumer client > >>> metrics, > >>> the rest of the metrics will be streams metrics that will be thread > level, > >>> is that right? > >>> > >>> Based on the feedback from Guozhang, I've updated the KIP to reflect > that > >>> the lower-level metrics are all client metrics that are then summed to > >>> compute the blocked time metric, which is a Streams metric. > >>> > >>> On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai > >>> wrote: > >>> > > Similarly, I think "txn-commit-time-total" and > "offset-commit-time-total" may better be inside producer and consumer > clients respectively. > > I agree for offset-commit-time-total. For txn-commit-time-total I'm > proposing we measure `StreamsProducer.commitTransaction`, which wraps > multiple producer calls (sendOffsets, commitTransaction) > > >> For "txn-commit-time-total" specifically, besides > producer.commitTxn. > other txn-related calls may also be blocking, including > producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" > later in the doc, but did not include it as a separate metric, and > similarly, should we have a `txn-abort-time-total` as well? If yes, > could > you update the KIP page accordingly. > > `beginTransaction` is not blocking - I meant to remove that from that > doc. I'll add something for abort. > > On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai > > wrote: > > > Thanks for the review Guozhang! responding to your feedback inline: > > > >> 1) I agree that the current ratio metrics is just "snapshot in > > point", and > > more flexib
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
Hi Rohan, Thank you for the KIP! I agree that the KIP is well-motivated. What is not very clear is the metadata like type, group, and tags of the metrics. For example, there is not application-id tag in Streams and there is also no producer-id tag. The clients, i.e., producer, admin, consumer, and also Streams have a client-id tag, that corresponds to the producer-id, consumer-id, etc you use in the KIP. For examples of metadata used in Streams you can look at the following KIPs: - https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams - https://cwiki.apache.org/confluence/display/KAFKA/KIP-471%3A+Expose+RocksDB+Metrics+in+Kafka+Streams - https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB - https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams Best, Bruno On 22.07.21 09:42, Rohan Desai wrote: re sophie: The intent here was to include all blocked time (not just `RUNNING`). The caller can window the total blocked time themselves, and that can be compared with a timeseries of the state to understand the ratio in different states. I'll update the KIP to include `committed`. The admin API calls should be accounted for by the admin client iotime/iowaittime metrics. On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai wrote: I remember now that we moved the round-trip PID's txn completion logic into init-transaction and commit/abort-transaction. So I think we'd count time as in StreamsProducer#initTransaction as well (admittedly it is in most cases a one-time thing). Makes sense - I'll update the KIP On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai wrote: I had a question - it seems like from the descriptionsof `txn-commit-time-total` and `offset-commit-time-total` that they measure similar processes for ALOS and EOS, but only `txn-commit-time-total` is included in `blocked-time-total`. Why isn't `offset-commit-time-total` also included? I've updated the KIP to include it. Aside from `flush-time-total`, `txn-commit-time-total` and `offset-commit-time-total`, which will be producer/consumer client metrics, the rest of the metrics will be streams metrics that will be thread level, is that right? Based on the feedback from Guozhang, I've updated the KIP to reflect that the lower-level metrics are all client metrics that are then summed to compute the blocked time metric, which is a Streams metric. On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai wrote: Similarly, I think "txn-commit-time-total" and "offset-commit-time-total" may better be inside producer and consumer clients respectively. I agree for offset-commit-time-total. For txn-commit-time-total I'm proposing we measure `StreamsProducer.commitTransaction`, which wraps multiple producer calls (sendOffsets, commitTransaction) For "txn-commit-time-total" specifically, besides producer.commitTxn. other txn-related calls may also be blocking, including producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" later in the doc, but did not include it as a separate metric, and similarly, should we have a `txn-abort-time-total` as well? If yes, could you update the KIP page accordingly. `beginTransaction` is not blocking - I meant to remove that from that doc. I'll add something for abort. On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai wrote: Thanks for the review Guozhang! responding to your feedback inline: 1) I agree that the current ratio metrics is just "snapshot in point", and more flexible metrics that would allow reporters to calculate based on window intervals are better. However, the current mechanism of the proposed metrics assumes the thread->clients mapping as of today, where each thread would own exclusively one main consumer, restore consumer, producer and an admin client. But this mapping may be subject to change in the future. Have you thought about how this metric can be extended when, e.g. the embedded clients and stream threads are de-coupled? Of course this depends on how exactly we refactor the runtime - assuming that we plan to factor out consumers into an "I/O" layer that is responsible for receiving records and enqueuing them to be processed by processing threads, then I think it should be reasonable to count the time we spend blocked on this internal queue(s) as blocked. The main concern there to me is that the I/O layer would be doing something expensive like decompression that shouldn't be counted as "blocked". But if that really is so expensive that it starts to throw off our ratios then it's probably indicative of a larger problem that the "i/o layer" is a bottleneck and it would be worth refactoring so that decompression (or insert other expensive thing here) can also be done on the processing threads. 2) [This and all below are minor comments] The "flush-time-total" may better be a producer client metric, as "flush-wait-time-
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
re sophie: The intent here was to include all blocked time (not just `RUNNING`). The caller can window the total blocked time themselves, and that can be compared with a timeseries of the state to understand the ratio in different states. I'll update the KIP to include `committed`. The admin API calls should be accounted for by the admin client iotime/iowaittime metrics. On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai wrote: > > I remember now that we moved the round-trip PID's txn completion logic > into > init-transaction and commit/abort-transaction. So I think we'd count time > as in StreamsProducer#initTransaction as well (admittedly it is in most > cases a one-time thing). > > Makes sense - I'll update the KIP > > On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai > wrote: > >> >> > I had a question - it seems like from the descriptionsof >> `txn-commit-time-total` and `offset-commit-time-total` that they measure >> similar processes for ALOS and EOS, but only `txn-commit-time-total` is >> included in `blocked-time-total`. Why isn't `offset-commit-time-total` also >> included? >> >> I've updated the KIP to include it. >> >> > Aside from `flush-time-total`, `txn-commit-time-total` and >> `offset-commit-time-total`, which will be producer/consumer client >> metrics, >> the rest of the metrics will be streams metrics that will be thread level, >> is that right? >> >> Based on the feedback from Guozhang, I've updated the KIP to reflect that >> the lower-level metrics are all client metrics that are then summed to >> compute the blocked time metric, which is a Streams metric. >> >> On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai >> wrote: >> >>> > Similarly, I think "txn-commit-time-total" and >>> "offset-commit-time-total" may better be inside producer and consumer >>> clients respectively. >>> >>> I agree for offset-commit-time-total. For txn-commit-time-total I'm >>> proposing we measure `StreamsProducer.commitTransaction`, which wraps >>> multiple producer calls (sendOffsets, commitTransaction) >>> >>> > > For "txn-commit-time-total" specifically, besides >>> producer.commitTxn. >>> other txn-related calls may also be blocking, including >>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" >>> later in the doc, but did not include it as a separate metric, and >>> similarly, should we have a `txn-abort-time-total` as well? If yes, >>> could >>> you update the KIP page accordingly. >>> >>> `beginTransaction` is not blocking - I meant to remove that from that >>> doc. I'll add something for abort. >>> >>> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai >>> wrote: >>> Thanks for the review Guozhang! responding to your feedback inline: > 1) I agree that the current ratio metrics is just "snapshot in point", and more flexible metrics that would allow reporters to calculate based on window intervals are better. However, the current mechanism of the proposed metrics assumes the thread->clients mapping as of today, where each thread would own exclusively one main consumer, restore consumer, producer and an admin client. But this mapping may be subject to change in the future. Have you thought about how this metric can be extended when, e.g. the embedded clients and stream threads are de-coupled? Of course this depends on how exactly we refactor the runtime - assuming that we plan to factor out consumers into an "I/O" layer that is responsible for receiving records and enqueuing them to be processed by processing threads, then I think it should be reasonable to count the time we spend blocked on this internal queue(s) as blocked. The main concern there to me is that the I/O layer would be doing something expensive like decompression that shouldn't be counted as "blocked". But if that really is so expensive that it starts to throw off our ratios then it's probably indicative of a larger problem that the "i/o layer" is a bottleneck and it would be worth refactoring so that decompression (or insert other expensive thing here) can also be done on the processing threads. > 2) [This and all below are minor comments] The "flush-time-total" may better be a producer client metric, as "flush-wait-time-total", than a streams metric, though the streams-level "total-blocked" can still leverage it. Similarly, I think "txn-commit-time-total" and "offset-commit-time-total" may better be inside producer and consumer clients respectively. Good call - I'll update the KIP > 3) The doc was not very clear on how "thread-start-time" would be needed when calculating streams utilization along with total-blocked time, could you elaborate a bit more in the KIP? Yes, will do. > For "txn-commit-time-total" specifically, besides producer.commitTxn. other txn-related calls may also be bl
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
Hey Rohan, The current metrics proposed in the KIP all LGTM, but if the goal is to include *all* time spent blocking on any client API then there are a few that might need to be added to this list. For example Consumer#committed is called in StreamTask to get the offsets during initialization, and several blocking AdminClient APIs are used by Streams such as #listOffsets (as well as #createTopics and #describeTopics, but only during a rebalance to create/validate the internal topics) One thing that wasn't quite clear to me was whether this metric is intended to reflect only the time spent blocking on Kafka *while the StreamThread is in RUNNING*, or just the total time spent inside any Kafka client API after the thread has started up. If we only care about the former, then the current proposal is sufficient, but if we're actually targeting the latter then we need to consider those additional APIs listed above. Personally, I feel that both could actually be useful -- ie, we expose one metric that only corresponds to time we spend blocked on Kafka that could otherwise be spent processing records from active tasks, and a second metric that reports all time spent blocking on Kafka, whether the thread was still initializing (ie in the PARTITIONS_ASSIGNED state) or actively processing (ie in RUNNING). WDYT? On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai wrote: > > I remember now that we moved the round-trip PID's txn completion logic > into > init-transaction and commit/abort-transaction. So I think we'd count time > as in StreamsProducer#initTransaction as well (admittedly it is in most > cases a one-time thing). > > Makes sense - I'll update the KIP > > On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai > wrote: > > > > > > I had a question - it seems like from the descriptionsof > > `txn-commit-time-total` and `offset-commit-time-total` that they measure > > similar processes for ALOS and EOS, but only `txn-commit-time-total` is > > included in `blocked-time-total`. Why isn't `offset-commit-time-total` > also > > included? > > > > I've updated the KIP to include it. > > > > > Aside from `flush-time-total`, `txn-commit-time-total` and > > `offset-commit-time-total`, which will be producer/consumer client > metrics, > > the rest of the metrics will be streams metrics that will be thread > level, > > is that right? > > > > Based on the feedback from Guozhang, I've updated the KIP to reflect that > > the lower-level metrics are all client metrics that are then summed to > > compute the blocked time metric, which is a Streams metric. > > > > On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai > > wrote: > > > >> > Similarly, I think "txn-commit-time-total" and > >> "offset-commit-time-total" may better be inside producer and consumer > >> clients respectively. > >> > >> I agree for offset-commit-time-total. For txn-commit-time-total I'm > >> proposing we measure `StreamsProducer.commitTransaction`, which wraps > >> multiple producer calls (sendOffsets, commitTransaction) > >> > >> > > For "txn-commit-time-total" specifically, besides > producer.commitTxn. > >> other txn-related calls may also be blocking, including > >> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" > >> later in the doc, but did not include it as a separate metric, and > >> similarly, should we have a `txn-abort-time-total` as well? If yes, > could > >> you update the KIP page accordingly. > >> > >> `beginTransaction` is not blocking - I meant to remove that from that > >> doc. I'll add something for abort. > >> > >> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai > >> wrote: > >> > >>> Thanks for the review Guozhang! responding to your feedback inline: > >>> > >>> > 1) I agree that the current ratio metrics is just "snapshot in > >>> point", and > >>> more flexible metrics that would allow reporters to calculate based on > >>> window intervals are better. However, the current mechanism of the > >>> proposed > >>> metrics assumes the thread->clients mapping as of today, where each > >>> thread > >>> would own exclusively one main consumer, restore consumer, producer and > >>> an > >>> admin client. But this mapping may be subject to change in the future. > >>> Have > >>> you thought about how this metric can be extended when, e.g. the > embedded > >>> clients and stream threads are de-coupled? > >>> > >>> Of course this depends on how exactly we refactor the runtime - > assuming > >>> that we plan to factor out consumers into an "I/O" layer that is > >>> responsible for receiving records and enqueuing them to be processed by > >>> processing threads, then I think it should be reasonable to count the > time > >>> we spend blocked on this internal queue(s) as blocked. The main concern > >>> there to me is that the I/O layer would be doing something expensive > like > >>> decompression that shouldn't be counted as "blocked". But if that > really is > >>> so expensive that it starts to throw off our ratios then it's probably > >>> ind
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
> I remember now that we moved the round-trip PID's txn completion logic into init-transaction and commit/abort-transaction. So I think we'd count time as in StreamsProducer#initTransaction as well (admittedly it is in most cases a one-time thing). Makes sense - I'll update the KIP On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai wrote: > > > I had a question - it seems like from the descriptionsof > `txn-commit-time-total` and `offset-commit-time-total` that they measure > similar processes for ALOS and EOS, but only `txn-commit-time-total` is > included in `blocked-time-total`. Why isn't `offset-commit-time-total` also > included? > > I've updated the KIP to include it. > > > Aside from `flush-time-total`, `txn-commit-time-total` and > `offset-commit-time-total`, which will be producer/consumer client metrics, > the rest of the metrics will be streams metrics that will be thread level, > is that right? > > Based on the feedback from Guozhang, I've updated the KIP to reflect that > the lower-level metrics are all client metrics that are then summed to > compute the blocked time metric, which is a Streams metric. > > On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai > wrote: > >> > Similarly, I think "txn-commit-time-total" and >> "offset-commit-time-total" may better be inside producer and consumer >> clients respectively. >> >> I agree for offset-commit-time-total. For txn-commit-time-total I'm >> proposing we measure `StreamsProducer.commitTransaction`, which wraps >> multiple producer calls (sendOffsets, commitTransaction) >> >> > > For "txn-commit-time-total" specifically, besides producer.commitTxn. >> other txn-related calls may also be blocking, including >> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" >> later in the doc, but did not include it as a separate metric, and >> similarly, should we have a `txn-abort-time-total` as well? If yes, could >> you update the KIP page accordingly. >> >> `beginTransaction` is not blocking - I meant to remove that from that >> doc. I'll add something for abort. >> >> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai >> wrote: >> >>> Thanks for the review Guozhang! responding to your feedback inline: >>> >>> > 1) I agree that the current ratio metrics is just "snapshot in >>> point", and >>> more flexible metrics that would allow reporters to calculate based on >>> window intervals are better. However, the current mechanism of the >>> proposed >>> metrics assumes the thread->clients mapping as of today, where each >>> thread >>> would own exclusively one main consumer, restore consumer, producer and >>> an >>> admin client. But this mapping may be subject to change in the future. >>> Have >>> you thought about how this metric can be extended when, e.g. the embedded >>> clients and stream threads are de-coupled? >>> >>> Of course this depends on how exactly we refactor the runtime - assuming >>> that we plan to factor out consumers into an "I/O" layer that is >>> responsible for receiving records and enqueuing them to be processed by >>> processing threads, then I think it should be reasonable to count the time >>> we spend blocked on this internal queue(s) as blocked. The main concern >>> there to me is that the I/O layer would be doing something expensive like >>> decompression that shouldn't be counted as "blocked". But if that really is >>> so expensive that it starts to throw off our ratios then it's probably >>> indicative of a larger problem that the "i/o layer" is a bottleneck and it >>> would be worth refactoring so that decompression (or insert other expensive >>> thing here) can also be done on the processing threads. >>> >>> > 2) [This and all below are minor comments] The "flush-time-total" may >>> better be a producer client metric, as "flush-wait-time-total", than a >>> streams metric, though the streams-level "total-blocked" can still >>> leverage >>> it. Similarly, I think "txn-commit-time-total" and >>> "offset-commit-time-total" may better be inside producer and consumer >>> clients respectively. >>> >>> Good call - I'll update the KIP >>> >>> > 3) The doc was not very clear on how "thread-start-time" would be >>> needed >>> when calculating streams utilization along with total-blocked time, could >>> you elaborate a bit more in the KIP? >>> >>> Yes, will do. >>> >>> > For "txn-commit-time-total" specifically, besides producer.commitTxn. >>> other txn-related calls may also be blocking, including >>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" >>> later in the doc, but did not include it as a separate metric, and >>> similarly, should we have a `txn-abort-time-total` as well? If yes, >>> could >>> you update the KIP page accordingly. >>> >>> Ack. >>> >>> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai >>> wrote: >>> Hello All, I'd like to start a discussion on the KIP linked above which proposes some metrics that we would find useful to help measure whether a Kafka Streams appli
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
> I had a question - it seems like from the descriptionsof `txn-commit-time-total` and `offset-commit-time-total` that they measure similar processes for ALOS and EOS, but only `txn-commit-time-total` is included in `blocked-time-total`. Why isn't `offset-commit-time-total` also included? I've updated the KIP to include it. > Aside from `flush-time-total`, `txn-commit-time-total` and `offset-commit-time-total`, which will be producer/consumer client metrics, the rest of the metrics will be streams metrics that will be thread level, is that right? Based on the feedback from Guozhang, I've updated the KIP to reflect that the lower-level metrics are all client metrics that are then summed to compute the blocked time metric, which is a Streams metric. On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai wrote: > > Similarly, I think "txn-commit-time-total" and > "offset-commit-time-total" may better be inside producer and consumer > clients respectively. > > I agree for offset-commit-time-total. For txn-commit-time-total I'm > proposing we measure `StreamsProducer.commitTransaction`, which wraps > multiple producer calls (sendOffsets, commitTransaction) > > > > For "txn-commit-time-total" specifically, besides producer.commitTxn. > other txn-related calls may also be blocking, including > producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" > later in the doc, but did not include it as a separate metric, and > similarly, should we have a `txn-abort-time-total` as well? If yes, could > you update the KIP page accordingly. > > `beginTransaction` is not blocking - I meant to remove that from that doc. > I'll add something for abort. > > On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai > wrote: > >> Thanks for the review Guozhang! responding to your feedback inline: >> >> > 1) I agree that the current ratio metrics is just "snapshot in point", >> and >> more flexible metrics that would allow reporters to calculate based on >> window intervals are better. However, the current mechanism of the >> proposed >> metrics assumes the thread->clients mapping as of today, where each thread >> would own exclusively one main consumer, restore consumer, producer and an >> admin client. But this mapping may be subject to change in the future. >> Have >> you thought about how this metric can be extended when, e.g. the embedded >> clients and stream threads are de-coupled? >> >> Of course this depends on how exactly we refactor the runtime - assuming >> that we plan to factor out consumers into an "I/O" layer that is >> responsible for receiving records and enqueuing them to be processed by >> processing threads, then I think it should be reasonable to count the time >> we spend blocked on this internal queue(s) as blocked. The main concern >> there to me is that the I/O layer would be doing something expensive like >> decompression that shouldn't be counted as "blocked". But if that really is >> so expensive that it starts to throw off our ratios then it's probably >> indicative of a larger problem that the "i/o layer" is a bottleneck and it >> would be worth refactoring so that decompression (or insert other expensive >> thing here) can also be done on the processing threads. >> >> > 2) [This and all below are minor comments] The "flush-time-total" may >> better be a producer client metric, as "flush-wait-time-total", than a >> streams metric, though the streams-level "total-blocked" can still >> leverage >> it. Similarly, I think "txn-commit-time-total" and >> "offset-commit-time-total" may better be inside producer and consumer >> clients respectively. >> >> Good call - I'll update the KIP >> >> > 3) The doc was not very clear on how "thread-start-time" would be >> needed >> when calculating streams utilization along with total-blocked time, could >> you elaborate a bit more in the KIP? >> >> Yes, will do. >> >> > For "txn-commit-time-total" specifically, besides producer.commitTxn. >> other txn-related calls may also be blocking, including >> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" >> later in the doc, but did not include it as a separate metric, and >> similarly, should we have a `txn-abort-time-total` as well? If yes, could >> you update the KIP page accordingly. >> >> Ack. >> >> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai >> wrote: >> >>> Hello All, >>> >>> I'd like to start a discussion on the KIP linked above which proposes >>> some metrics that we would find useful to help measure whether a Kafka >>> Streams application is saturated. The motivation section in the KIP goes >>> into some more detail on why we think this is a useful addition to the >>> metrics already implemented. Thanks in advance for your feedback! >>> >>> Best Regards, >>> >>> Rohan >>> >>> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai >>> wrote: >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams >>>
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
Thanks Rohan. I remember now that we moved the round-trip PID's txn completion logic into init-transaction and commit/abort-transaction. So I think we'd count time as in StreamsProducer#initTransaction as well (admittedly it is in most cases a one-time thing). Other than that, I do not have any further comments. Guozhang On Tue, Jul 20, 2021 at 11:59 AM Rohan Desai wrote: > > Similarly, I think "txn-commit-time-total" and > "offset-commit-time-total" may better be inside producer and consumer > clients respectively. > > I agree for offset-commit-time-total. For txn-commit-time-total I'm > proposing we measure `StreamsProducer.commitTransaction`, which wraps > multiple producer calls (sendOffsets, commitTransaction) > > > > For "txn-commit-time-total" specifically, besides producer.commitTxn. > other txn-related calls may also be blocking, including > producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" > later in the doc, but did not include it as a separate metric, and > similarly, should we have a `txn-abort-time-total` as well? If yes, could > you update the KIP page accordingly. > > `beginTransaction` is not blocking - I meant to remove that from that doc. > I'll add something for abort. > > On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai > wrote: > > > Thanks for the review Guozhang! responding to your feedback inline: > > > > > 1) I agree that the current ratio metrics is just "snapshot in point", > > and > > more flexible metrics that would allow reporters to calculate based on > > window intervals are better. However, the current mechanism of the > proposed > > metrics assumes the thread->clients mapping as of today, where each > thread > > would own exclusively one main consumer, restore consumer, producer and > an > > admin client. But this mapping may be subject to change in the future. > Have > > you thought about how this metric can be extended when, e.g. the embedded > > clients and stream threads are de-coupled? > > > > Of course this depends on how exactly we refactor the runtime - assuming > > that we plan to factor out consumers into an "I/O" layer that is > > responsible for receiving records and enqueuing them to be processed by > > processing threads, then I think it should be reasonable to count the > time > > we spend blocked on this internal queue(s) as blocked. The main concern > > there to me is that the I/O layer would be doing something expensive like > > decompression that shouldn't be counted as "blocked". But if that really > is > > so expensive that it starts to throw off our ratios then it's probably > > indicative of a larger problem that the "i/o layer" is a bottleneck and > it > > would be worth refactoring so that decompression (or insert other > expensive > > thing here) can also be done on the processing threads. > > > > > 2) [This and all below are minor comments] The "flush-time-total" may > > better be a producer client metric, as "flush-wait-time-total", than a > > streams metric, though the streams-level "total-blocked" can still > leverage > > it. Similarly, I think "txn-commit-time-total" and > > "offset-commit-time-total" may better be inside producer and consumer > > clients respectively. > > > > Good call - I'll update the KIP > > > > > 3) The doc was not very clear on how "thread-start-time" would be > needed > > when calculating streams utilization along with total-blocked time, could > > you elaborate a bit more in the KIP? > > > > Yes, will do. > > > > > For "txn-commit-time-total" specifically, besides producer.commitTxn. > > other txn-related calls may also be blocking, including > > producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" > > later in the doc, but did not include it as a separate metric, and > > similarly, should we have a `txn-abort-time-total` as well? If yes, could > > you update the KIP page accordingly. > > > > Ack. > > > > On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai > > wrote: > > > >> Hello All, > >> > >> I'd like to start a discussion on the KIP linked above which proposes > >> some metrics that we would find useful to help measure whether a Kafka > >> Streams application is saturated. The motivation section in the KIP goes > >> into some more detail on why we think this is a useful addition to the > >> metrics already implemented. Thanks in advance for your feedback! > >> > >> Best Regards, > >> > >> Rohan > >> > >> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai > >> wrote: > >> > >>> > >>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams > >>> > >> > -- -- Guozhang
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
> Similarly, I think "txn-commit-time-total" and "offset-commit-time-total" may better be inside producer and consumer clients respectively. I agree for offset-commit-time-total. For txn-commit-time-total I'm proposing we measure `StreamsProducer.commitTransaction`, which wraps multiple producer calls (sendOffsets, commitTransaction) > > For "txn-commit-time-total" specifically, besides producer.commitTxn. other txn-related calls may also be blocking, including producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" later in the doc, but did not include it as a separate metric, and similarly, should we have a `txn-abort-time-total` as well? If yes, could you update the KIP page accordingly. `beginTransaction` is not blocking - I meant to remove that from that doc. I'll add something for abort. On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai wrote: > Thanks for the review Guozhang! responding to your feedback inline: > > > 1) I agree that the current ratio metrics is just "snapshot in point", > and > more flexible metrics that would allow reporters to calculate based on > window intervals are better. However, the current mechanism of the proposed > metrics assumes the thread->clients mapping as of today, where each thread > would own exclusively one main consumer, restore consumer, producer and an > admin client. But this mapping may be subject to change in the future. Have > you thought about how this metric can be extended when, e.g. the embedded > clients and stream threads are de-coupled? > > Of course this depends on how exactly we refactor the runtime - assuming > that we plan to factor out consumers into an "I/O" layer that is > responsible for receiving records and enqueuing them to be processed by > processing threads, then I think it should be reasonable to count the time > we spend blocked on this internal queue(s) as blocked. The main concern > there to me is that the I/O layer would be doing something expensive like > decompression that shouldn't be counted as "blocked". But if that really is > so expensive that it starts to throw off our ratios then it's probably > indicative of a larger problem that the "i/o layer" is a bottleneck and it > would be worth refactoring so that decompression (or insert other expensive > thing here) can also be done on the processing threads. > > > 2) [This and all below are minor comments] The "flush-time-total" may > better be a producer client metric, as "flush-wait-time-total", than a > streams metric, though the streams-level "total-blocked" can still leverage > it. Similarly, I think "txn-commit-time-total" and > "offset-commit-time-total" may better be inside producer and consumer > clients respectively. > > Good call - I'll update the KIP > > > 3) The doc was not very clear on how "thread-start-time" would be needed > when calculating streams utilization along with total-blocked time, could > you elaborate a bit more in the KIP? > > Yes, will do. > > > For "txn-commit-time-total" specifically, besides producer.commitTxn. > other txn-related calls may also be blocking, including > producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" > later in the doc, but did not include it as a separate metric, and > similarly, should we have a `txn-abort-time-total` as well? If yes, could > you update the KIP page accordingly. > > Ack. > > On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai > wrote: > >> Hello All, >> >> I'd like to start a discussion on the KIP linked above which proposes >> some metrics that we would find useful to help measure whether a Kafka >> Streams application is saturated. The motivation section in the KIP goes >> into some more detail on why we think this is a useful addition to the >> metrics already implemented. Thanks in advance for your feedback! >> >> Best Regards, >> >> Rohan >> >> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai >> wrote: >> >>> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams >>> >>
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
Thanks for the KIP Rohan. I had a question - it seems like from the descriptions of `txn-commit-time-total` and `offset-commit-time-total` that they measure similar processes for ALOS and EOS, but only `txn-commit-time-total` is included in `blocked-time-total`. Why isn't `offset-commit-time-total` also included? Aside from `flush-time-total`, `txn-commit-time-total` and `offset-commit-time-total`, which will be producer/consumer client metrics, the rest of the metrics will be streams metrics that will be thread level, is that right? Cheers, Leah On Tue, Jul 20, 2021 at 2:00 AM Rohan Desai wrote: > Thanks for the review Guozhang! responding to your feedback inline: > > > 1) I agree that the current ratio metrics is just "snapshot in point", > and > more flexible metrics that would allow reporters to calculate based on > window intervals are better. However, the current mechanism of the proposed > metrics assumes the thread->clients mapping as of today, where each thread > would own exclusively one main consumer, restore consumer, producer and an > admin client. But this mapping may be subject to change in the future. Have > you thought about how this metric can be extended when, e.g. the embedded > clients and stream threads are de-coupled? > > Of course this depends on how exactly we refactor the runtime - assuming > that we plan to factor out consumers into an "I/O" layer that is > responsible for receiving records and enqueuing them to be processed by > processing threads, then I think it should be reasonable to count the time > we spend blocked on this internal queue(s) as blocked. The main concern > there to me is that the I/O layer would be doing something expensive like > decompression that shouldn't be counted as "blocked". But if that really is > so expensive that it starts to throw off our ratios then it's probably > indicative of a larger problem that the "i/o layer" is a bottleneck and it > would be worth refactoring so that decompression (or insert other expensive > thing here) can also be done on the processing threads. > > > 2) [This and all below are minor comments] The "flush-time-total" may > better be a producer client metric, as "flush-wait-time-total", than a > streams metric, though the streams-level "total-blocked" can still leverage > it. Similarly, I think "txn-commit-time-total" and > "offset-commit-time-total" may better be inside producer and consumer > clients respectively. > > Good call - I'll update the KIP > > > 3) The doc was not very clear on how "thread-start-time" would be needed > when calculating streams utilization along with total-blocked time, could > you elaborate a bit more in the KIP? > > Yes, will do. > > > For "txn-commit-time-total" specifically, besides producer.commitTxn. > other txn-related calls may also be blocking, including > producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" > later in the doc, but did not include it as a separate metric, and > similarly, should we have a `txn-abort-time-total` as well? If yes, could > you update the KIP page accordingly. > > Ack. > > On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai > wrote: > > > Hello All, > > > > I'd like to start a discussion on the KIP linked above which proposes > some > > metrics that we would find useful to help measure whether a Kafka Streams > > application is saturated. The motivation section in the KIP goes into > some > > more detail on why we think this is a useful addition to the metrics > > already implemented. Thanks in advance for your feedback! > > > > Best Regards, > > > > Rohan > > > > On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai > > wrote: > > > >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams > >> > > >
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
Thanks for the review Guozhang! responding to your feedback inline: > 1) I agree that the current ratio metrics is just "snapshot in point", and more flexible metrics that would allow reporters to calculate based on window intervals are better. However, the current mechanism of the proposed metrics assumes the thread->clients mapping as of today, where each thread would own exclusively one main consumer, restore consumer, producer and an admin client. But this mapping may be subject to change in the future. Have you thought about how this metric can be extended when, e.g. the embedded clients and stream threads are de-coupled? Of course this depends on how exactly we refactor the runtime - assuming that we plan to factor out consumers into an "I/O" layer that is responsible for receiving records and enqueuing them to be processed by processing threads, then I think it should be reasonable to count the time we spend blocked on this internal queue(s) as blocked. The main concern there to me is that the I/O layer would be doing something expensive like decompression that shouldn't be counted as "blocked". But if that really is so expensive that it starts to throw off our ratios then it's probably indicative of a larger problem that the "i/o layer" is a bottleneck and it would be worth refactoring so that decompression (or insert other expensive thing here) can also be done on the processing threads. > 2) [This and all below are minor comments] The "flush-time-total" may better be a producer client metric, as "flush-wait-time-total", than a streams metric, though the streams-level "total-blocked" can still leverage it. Similarly, I think "txn-commit-time-total" and "offset-commit-time-total" may better be inside producer and consumer clients respectively. Good call - I'll update the KIP > 3) The doc was not very clear on how "thread-start-time" would be needed when calculating streams utilization along with total-blocked time, could you elaborate a bit more in the KIP? Yes, will do. > For "txn-commit-time-total" specifically, besides producer.commitTxn. other txn-related calls may also be blocking, including producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" later in the doc, but did not include it as a separate metric, and similarly, should we have a `txn-abort-time-total` as well? If yes, could you update the KIP page accordingly. Ack. On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai wrote: > Hello All, > > I'd like to start a discussion on the KIP linked above which proposes some > metrics that we would find useful to help measure whether a Kafka Streams > application is saturated. The motivation section in the KIP goes into some > more detail on why we think this is a useful addition to the metrics > already implemented. Thanks in advance for your feedback! > > Best Regards, > > Rohan > > On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai > wrote: > >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams >> >
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
Hello All, I'd like to start a discussion on the KIP linked above which proposes some metrics that we would find useful to help measure whether a Kafka Streams application is saturated. The motivation section in the KIP goes into some more detail on why we think this is a useful addition to the metrics already implemented. Thanks in advance for your feedback! Best Regards, Rohan On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai wrote: > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams >
Re: [DISCUSS] KIP-761: Add total blocked time metric to streams
Hey Rohan, Thanks for putting up the KIP. Maybe you can also briefly talk about the context in the email message as well (they are very well explained inside the KIP doc though). I have a few meta and detailed thoughts after reading it. 1) I agree that the current ratio metrics is just "snapshot in point", and more flexible metrics that would allow reporters to calculate based on window intervals are better. However, the current mechanism of the proposed metrics assumes the thread->clients mapping as of today, where each thread would own exclusively one main consumer, restore consumer, producer and an admin client. But this mapping may be subject to change in the future. Have you thought about how this metric can be extended when, e.g. the embedded clients and stream threads are de-coupled? 2) [This and all below are minor comments] The "flush-time-total" may better be a producer client metric, as "flush-wait-time-total", than a streams metric, though the streams-level "total-blocked" can still leverage it. Similarly, I think "txn-commit-time-total" and "offset-commit-time-total" may better be inside producer and consumer clients respectively. 3) The doc was not very clear on how "thread-start-time" would be needed when calculating streams utilization along with total-blocked time, could you elaborate a bit more in the KIP? 4) For "txn-commit-time-total" specifically, besides producer.commitTxn, other txn-related calls may also be blocking, including producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total" later in the doc, but did not include it as a separate metric, and similarly, should we have a `txn-abort-time-total` as well? If yes, could you update the KIP page accordingly. 5) Not a suggestion, but just wanted to bring up that the producer related metrics are a bit "coarsen" compared with the consumer/admin clients since their IO mechanisms are a bit different: for producer the caller thread does not do any IOs since it's all delegated to the background sender thread, while for consumer/admin the caller thread would still need to do some IOs, and hence the selector-level metrics would make sense. On top of my head I cannot think of a better measuring mechanism for producers either, especially for txn-related ones, we may need to experiment and see if the generated ratio is relatively accurate and reasonable with the reflected "block time". Guozhang On Mon, Jul 12, 2021 at 12:01 PM Rohan Desai wrote: > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams > -- -- Guozhang
[DISCUSS] KIP-761: Add total blocked time metric to streams
https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams