Hey! Thanks for looking over this KIP. These are all partitions per manager. I'll clear this up in the KIP.
Anastasia On Sat, Jul 13, 2019, 2:15 PM Stanislav Kozlovski <stanis...@confluent.io> wrote: > Hey there, > > Thanks for the KIP! I always smile when I see new metrics > I have just one quick question - are these metrics per partition or a total > of all the partitions per manager? The KIP has conflicting sentences so I'm > not sure what it is. > > Thanks, > Stanislav > > > On Wed, Jul 3, 2019 at 9:13 PM Gwen Shapira <g...@confluent.io> wrote: > > > It looks great! If there are no more concerns, lets start a vote. > > > > On Tue, Jul 2, 2019 at 2:59 PM Anastasia Vela <av...@confluent.io> > wrote: > > > > > > After further discussion with Anna, we decided the following: > > > - add the average metric, but noted in the KIP that the average value > may > > > look low at times when there are many empty partitions that have a 0ms > > load > > > time > > > - set the window to 30sec, because there is no significance difference > if > > > we set the window time to 3hrs, so I will keep the default value > instead. > > > > > > Thanks. Let me know any more concerns. > > > Anastasia > > > > > > On Mon, Jul 1, 2019 at 9:13 AM Anastasia Vela <av...@confluent.io> > > wrote: > > > > > > > Hey Gwen! > > > > > > > > Thanks for reviewing my KIP! > > > > > > > > 1. I did consider adding an Avg metric as well. Anna and I decided > > that a > > > > max would provide the crucial information. We just need to know if > > there > > > > was a long load time, and expose what that duration was so we > > understand > > > > there's downtime for such a long time. However I do agree that it may > > be > > > > necessary to compute averages if we want to give the max a reference > > point. > > > > I can easily add this if we believe it is necessary. > > > > 2. The default refers to the metric configuration set when you > > initialize > > > > KafkaServer. When I was running tests, the max value was computed > over > > a > > > > window of 30 seconds, unless I changed the metrics config. So I noted > > that > > > > unless we change the config for this specific metric, it will be > > computed > > > > over the default window. > > > > 3. I proposed a 3 hour window because we have (very rarely) seen > > > > partitions take hours to load. 3 hours was an upper bound for how > long > > a > > > > load could take. The way max works is that it computes the running > max > > > > until the window has lapsed. Then it starts a new window and forgets > > the > > > > max value of the last window. So if a partition takes more than the > > window > > > > time to load, there will be one value in that window and the next > load > > will > > > > be part of a new window. I guess it just depends on how we want it to > > be > > > > displayed on the graph. If it's ok for this behavior to happen, the > > window > > > > can be shrunk. Regarding the rate metric, I was actually thinking > about > > > > doing this, but I was told that loads don't happen very often. But it > > is > > > > true that if the reload happens very often then that may be a > problem. > > > > > > > > Thanks, > > > > Anastasia > > > > > > > > On Fri, Jun 28, 2019 at 4:27 PM Gwen Shapira <g...@confluent.io> > > wrote: > > > > > > > >> Hey, > > > >> > > > >> Thank you for proposing this! Sounds really useful - we have > > > >> definitely seem some difficult to explain pauses in consumer > activity > > > >> and this metric will let us correlate those. > > > >> > > > >> Few questions: > > > >> 1. Did you consider adding both Max and Avg metrics? Many of our > > > >> metrics have both (batch-size and message-size for example) and it > > > >> helps put the max value in context. > > > >> 2. You wrote: "Lengthening or shortening the 3 hour time window is > up > > > >> for discussion (default is 30sec)." and I'm not sure what default > you > > > >> are referring to? > > > >> 3. Can you also give some background on why you are proposing 3h? > I'm > > > >> guessing it is because loading the state from the topic happens > rarely > > > >> enough that in 3h it will probably only happen once or not at all? > > > >> Perhaps we need a rate metric to see how often it actually happens > (if > > > >> we have to reload offsets very often it is a different problem). > > > >> > > > >> Gwen > > > >> > > > >> On Tue, Jun 25, 2019 at 4:43 PM Anastasia Vela <av...@confluent.io> > > > >> wrote: > > > >> > > > > >> > Hi all, > > > >> > > > > >> > I'd like to discuss KIP-484: > > > >> > > > > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-484%3A+Expose+metrics+for+group+and+transaction+metadata+loading+duration > > > >> > > > > >> > Let me know what you think! > > > >> > > > > >> > Thanks, > > > >> > Anastasia > > > >> > > > >> > > > >> > > > >> -- > > > >> Gwen Shapira > > > >> Product Manager | Confluent > > > >> 650.450.2760 | @gwenshap > > > >> Follow us: Twitter | blog > > > >> > > > > > > > > > > > > -- > > Gwen Shapira > > Product Manager | Confluent > > 650.450.2760 | @gwenshap > > Follow us: Twitter | blog > > > > > -- > Best, > Stanislav >