Thanks for the KIP Rohan.

I had a question - it seems like from the descriptions
of `txn-commit-time-total` and `offset-commit-time-total` that they measure
similar processes for ALOS and EOS, but only `txn-commit-time-total` is
included in `blocked-time-total`. Why isn't `offset-commit-time-total` also
included?

Aside from `flush-time-total`, `txn-commit-time-total` and
`offset-commit-time-total`, which will be producer/consumer client metrics,
the rest of the metrics will be streams metrics that will be thread level,
is that right?

Cheers,
Leah

On Tue, Jul 20, 2021 at 2:00 AM Rohan Desai <desai.p.ro...@gmail.com> wrote:

> Thanks for the review Guozhang! responding to your feedback inline:
>
> > 1) I agree that the current ratio metrics is just "snapshot in point",
> and
> more flexible metrics that would allow reporters to calculate based on
> window intervals are better. However, the current mechanism of the proposed
> metrics assumes the thread->clients mapping as of today, where each thread
> would own exclusively one main consumer, restore consumer, producer and an
> admin client. But this mapping may be subject to change in the future. Have
> you thought about how this metric can be extended when, e.g. the embedded
> clients and stream threads are de-coupled?
>
> Of course this depends on how exactly we refactor the runtime - assuming
> that we plan to factor out consumers into an "I/O" layer that is
> responsible for receiving records and enqueuing them to be processed by
> processing threads, then I think it should be reasonable to count the time
> we spend blocked on this internal queue(s) as blocked. The main concern
> there to me is that the I/O layer would be doing something expensive like
> decompression that shouldn't be counted as "blocked". But if that really is
> so expensive that it starts to throw off our ratios then it's probably
> indicative of a larger problem that the "i/o layer" is a bottleneck and it
> would be worth refactoring so that decompression (or insert other expensive
> thing here) can also be done on the processing threads.
>
> > 2) [This and all below are minor comments] The "flush-time-total" may
> better be a producer client metric, as "flush-wait-time-total", than a
> streams metric, though the streams-level "total-blocked" can still leverage
> it. Similarly, I think "txn-commit-time-total" and
> "offset-commit-time-total" may better be inside producer and consumer
> clients respectively.
>
> Good call - I'll update the KIP
>
> > 3) The doc was not very clear on how "thread-start-time" would be needed
> when calculating streams utilization along with total-blocked time, could
> you elaborate a bit more in the KIP?
>
> Yes, will do.
>
> > For "txn-commit-time-total" specifically, besides producer.commitTxn.
> other txn-related calls may also be blocking, including
> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
> later in the doc, but did not include it as a separate metric, and
> similarly, should we have a `txn-abort-time-total` as well? If yes, could
> you update the KIP page accordingly.
>
> Ack.
>
> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <desai.p.ro...@gmail.com>
> wrote:
>
> > Hello All,
> >
> > I'd like to start a discussion on the KIP linked above which proposes
> some
> > metrics that we would find useful to help measure whether a Kafka Streams
> > application is saturated. The motivation section in the KIP goes into
> some
> > more detail on why we think this is a useful addition to the metrics
> > already implemented. Thanks in advance for your feedback!
> >
> > Best Regards,
> >
> > Rohan
> >
> > On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <desai.p.ro...@gmail.com>
> > wrote:
> >
> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
> >>
> >
>

Reply via email to