Re: [DISCUSS] KIP-484: Expose metrics for group and transaction metadata loading duration

Anastasia Vela Sat, 13 Jul 2019 15:39:46 -0700

Hey! Thanks for looking over this KIP. These are all partitions per
manager. I'll clear this up in the KIP.


Anastasia

On Sat, Jul 13, 2019, 2:15 PM Stanislav Kozlovski <[email protected]>
wrote:

> Hey there,
>
> Thanks for the KIP! I always smile when I see new metrics
> I have just one quick question - are these metrics per partition or a total
> of all the partitions per manager? The KIP has conflicting sentences so I'm
> not sure what it is.
>
> Thanks,
> Stanislav
>
>
> On Wed, Jul 3, 2019 at 9:13 PM Gwen Shapira <[email protected]> wrote:
>
> > It looks great! If there are no more concerns, lets start a vote.
> >
> > On Tue, Jul 2, 2019 at 2:59 PM Anastasia Vela <[email protected]>
> wrote:
> > >
> > > After further discussion with Anna, we decided the following:
> > > - add the average metric, but noted in the KIP that the average value
> may
> > > look low at times when there are many empty partitions that have a 0ms
> > load
> > > time
> > > - set the window to 30sec, because there is no significance difference
> if
> > > we set the window time to 3hrs, so I will keep the default value
> instead.
> > >
> > > Thanks. Let me know any more concerns.
> > > Anastasia
> > >
> > > On Mon, Jul 1, 2019 at 9:13 AM Anastasia Vela <[email protected]>
> > wrote:
> > >
> > > > Hey Gwen!
> > > >
> > > > Thanks for reviewing my KIP!
> > > >
> > > > 1. I did consider adding an Avg metric as well. Anna and I decided
> > that a
> > > > max would provide the crucial information. We just need to know if
> > there
> > > > was a long load time, and expose what that duration was so we
> > understand
> > > > there's downtime for such a long time. However I do agree that it may
> > be
> > > > necessary to compute averages if we want to give the max a reference
> > point.
> > > > I can easily add this if we believe it is necessary.
> > > > 2. The default refers to the metric configuration set when you
> > initialize
> > > > KafkaServer. When I was running tests, the max value was computed
> over
> > a
> > > > window of 30 seconds, unless I changed the metrics config. So I noted
> > that
> > > > unless we change the config for this specific metric, it will be
> > computed
> > > > over the default window.
> > > > 3. I proposed a 3 hour window because we have (very rarely) seen
> > > > partitions take hours to load. 3 hours was an upper bound for how
> long
> > a
> > > > load could take. The way max works is that it computes the running
> max
> > > > until the window has lapsed. Then it starts a new window and forgets
> > the
> > > > max value of the last window. So if a partition takes more than the
> > window
> > > > time to load, there will be one value in that window and the next
> load
> > will
> > > > be part of a new window. I guess it just depends on how we want it to
> > be
> > > > displayed on the graph. If it's ok for this behavior to happen, the
> > window
> > > > can be shrunk. Regarding the rate metric, I was actually thinking
> about
> > > > doing this, but I was told that loads don't happen very often. But it
> > is
> > > > true that if the reload happens very often then that may be a
> problem.
> > > >
> > > > Thanks,
> > > > Anastasia
> > > >
> > > > On Fri, Jun 28, 2019 at 4:27 PM Gwen Shapira <[email protected]>
> > wrote:
> > > >
> > > >> Hey,
> > > >>
> > > >> Thank you for proposing this! Sounds really useful - we have
> > > >> definitely seem some difficult to explain pauses in consumer
> activity
> > > >> and this metric will let us correlate those.
> > > >>
> > > >> Few questions:
> > > >> 1. Did you consider adding both Max and Avg metrics? Many of our
> > > >> metrics have both (batch-size and message-size for example) and it
> > > >> helps put the max value in context.
> > > >> 2. You wrote: "Lengthening or shortening the 3 hour time window is
> up
> > > >> for discussion (default is 30sec)."  and I'm not sure what default
> you
> > > >> are referring to?
> > > >> 3. Can you also give some background on why you are proposing 3h?
> I'm
> > > >> guessing it is because loading the state from the topic happens
> rarely
> > > >> enough that in 3h it will probably only happen once or not at all?
> > > >> Perhaps we need a rate metric to see how often it actually happens
> (if
> > > >> we have to reload offsets very often it is a different problem).
> > > >>
> > > >> Gwen
> > > >>
> > > >> On Tue, Jun 25, 2019 at 4:43 PM Anastasia Vela <[email protected]>
> > > >> wrote:
> > > >> >
> > > >> > Hi all,
> > > >> >
> > > >> > I'd like to discuss KIP-484:
> > > >> >
> > > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-484%3A+Expose+metrics+for+group+and+transaction+metadata+loading+duration
> > > >> >
> > > >> > Let me know what you think!
> > > >> >
> > > >> > Thanks,
> > > >> > Anastasia
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Gwen Shapira
> > > >> Product Manager | Confluent
> > > >> 650.450.2760 | @gwenshap
> > > >> Follow us: Twitter | blog
> > > >>
> > > >
> >
> >
> >
> > --
> > Gwen Shapira
> > Product Manager | Confluent
> > 650.450.2760 | @gwenshap
> > Follow us: Twitter | blog
> >
>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-484: Expose metrics for group and transaction metadata loading duration

Reply via email to