Hey Gwen!

Thanks for reviewing my KIP!

1. I did consider adding an Avg metric as well. Anna and I decided that a
max would provide the crucial information. We just need to know if there
was a long load time, and expose what that duration was so we understand
there's downtime for such a long time. However I do agree that it may be
necessary to compute averages if we want to give the max a reference point.
I can easily add this if we believe it is necessary.
2. The default refers to the metric configuration set when you initialize
KafkaServer. When I was running tests, the max value was computed over a
window of 30 seconds, unless I changed the metrics config. So I noted that
unless we change the config for this specific metric, it will be computed
over the default window.
3. I proposed a 3 hour window because we have (very rarely) seen partitions
take hours to load. 3 hours was an upper bound for how long a load could
take. The way max works is that it computes the running max until the
window has lapsed. Then it starts a new window and forgets the max value of
the last window. So if a partition takes more than the window time to load,
there will be one value in that window and the next load will be part of a
new window. I guess it just depends on how we want it to be displayed on
the graph. If it's ok for this behavior to happen, the window can be
shrunk. Regarding the rate metric, I was actually thinking about doing
this, but I was told that loads don't happen very often. But it is true
that if the reload happens very often then that may be a problem.

Thanks,
Anastasia

On Fri, Jun 28, 2019 at 4:27 PM Gwen Shapira <g...@confluent.io> wrote:

> Hey,
>
> Thank you for proposing this! Sounds really useful - we have
> definitely seem some difficult to explain pauses in consumer activity
> and this metric will let us correlate those.
>
> Few questions:
> 1. Did you consider adding both Max and Avg metrics? Many of our
> metrics have both (batch-size and message-size for example) and it
> helps put the max value in context.
> 2. You wrote: "Lengthening or shortening the 3 hour time window is up
> for discussion (default is 30sec)."  and I'm not sure what default you
> are referring to?
> 3. Can you also give some background on why you are proposing 3h? I'm
> guessing it is because loading the state from the topic happens rarely
> enough that in 3h it will probably only happen once or not at all?
> Perhaps we need a rate metric to see how often it actually happens (if
> we have to reload offsets very often it is a different problem).
>
> Gwen
>
> On Tue, Jun 25, 2019 at 4:43 PM Anastasia Vela <av...@confluent.io> wrote:
> >
> > Hi all,
> >
> > I'd like to discuss KIP-484:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-484%3A+Expose+metrics+for+group+and+transaction+metadata+loading+duration
> >
> > Let me know what you think!
> >
> > Thanks,
> > Anastasia
>
>
>
> --
> Gwen Shapira
> Product Manager | Confluent
> 650.450.2760 | @gwenshap
> Follow us: Twitter | blog
>

Reply via email to