Hi Venkata, Unfortunately the GC CPU Usage , percentage or otherwise, is not something that the JVM GC beans expose. So I am afraid we cannot easily add that.
Regards Gyula On Wed, Sep 13, 2023 at 7:59 PM Venkatakrishnan Sowrirajan <vsowr...@asu.edu> wrote: > Hi Gyula, > > Thanks for driving this FLIP. > > The proposal looks good to me. Only one minor suggestion I have is, can we > also include the % GC time spent wrt the overall CPU time especially useful > in the cases of TM which helps in easily identifying issues related to GC. > Thoughts? > > Regards > Venkata krishnan > > > On Wed, Sep 13, 2023 at 6:13 AM Gyula Fóra <gyula.f...@gmail.com> wrote: > > > Thanks for all the feedback, I will start the vote on this. > > > > Gyula > > > > On Wed, Sep 6, 2023 at 11:11 AM Xintong Song <tonysong...@gmail.com> > > wrote: > > > > > > > > > > I added the average time metric to the FLIP document. I also included > > it > > > > for the aggregate (total) across all collectors. But maybe it doesn't > > > make > > > > too much sense as collection times usually differ greatly depending > on > > > the > > > > collector. > > > > > > > > > > LGTM > > > > > > > > > Best, > > > > > > Xintong > > > > > > > > > > > > On Wed, Sep 6, 2023 at 4:31 PM Gyula Fóra <gyula.f...@gmail.com> > wrote: > > > > > > > I added the average time metric to the FLIP document. I also included > > it > > > > for the aggregate (total) across all collectors. But maybe it doesn't > > > make > > > > too much sense as collection times usually differ greatly depending > on > > > the > > > > collector. > > > > > > > > Gyula > > > > > > > > On Wed, Sep 6, 2023 at 10:21 AM Xintong Song <tonysong...@gmail.com> > > > > wrote: > > > > > > > > > Thank you :) > > > > > > > > > > Best, > > > > > > > > > > Xintong > > > > > > > > > > > > > > > > > > > > On Wed, Sep 6, 2023 at 4:17 PM Gyula Fóra <gyula.f...@gmail.com> > > > wrote: > > > > > > > > > > > Makes sense Xintong, I am happy to extend the proposal with the > > > average > > > > > gc > > > > > > time metric +1 > > > > > > > > > > > > Gyula > > > > > > > > > > > > On Wed, Sep 6, 2023 at 10:09 AM Xintong Song < > > tonysong...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > Just so I understand correctly, do you suggest adding a > metric > > > for > > > > > > > > delta(Time) / delta(Count) since the last reporting ? > > > > > > > > <Collector>.TimePerGc or <Collector>.AverageTime would make > > > sense. > > > > > > > > AverageTime may be a bit nicer :) > > > > > > > > > > > > > > > > > > > > > > Yes, that's what I mean. > > > > > > > > > > > > > > My only concern is how useful this will be in reality. If there > > are > > > > > only > > > > > > > > (or several) long pauses then the msPerSec metrics will show > it > > > > > > already, > > > > > > > > and if there is a single long pause that may not be shown at > > all > > > if > > > > > > there > > > > > > > > are several shorter pauses as well with this metric. > > > > > > > > > > > > > > > > > > > > > Let's say we measure this for every minute and see a 900 > msPerSec > > > > > (which > > > > > > > means 54s within the minute are spent on GC). This may come > from > > a > > > > > single > > > > > > > GC that lasts for 54s, or 2 GCs each lasting for ~27s, or more > > GCs > > > > with > > > > > > > less time each. As the default heartbeat timeout is 50s, the > > former > > > > > means > > > > > > > there's likely a heartbeat timeout due to the GC pause, while > the > > > > > latter > > > > > > > means otherwise. > > > > > > > > > > > > > > > > > > > > > Mathematically, it is possible that there's 1 long pause > together > > > > with > > > > > > > several short pauses within the same measurement period, making > > the > > > > > long > > > > > > > pause not observable with AverageTime. However, from my > > experience, > > > > > such > > > > > > a > > > > > > > pattern is not normal in reality. My observation is that GCs > > happen > > > > at > > > > > a > > > > > > > similar time usually take a similar length of time. Admittedly, > > > this > > > > is > > > > > > not > > > > > > > a hard guarantee. > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > Xintong > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra < > gyula.f...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > Matt Wang, > > > > > > > > > > > > > > > > I think the currently exposed info is all that is available > > > through > > > > > > > > GarbageCollectorMXBean. This FLIP does not aim to introduce a > > new > > > > > more > > > > > > > > granular way of reporting the per collector metrics, that > would > > > > > > require a > > > > > > > > new mechanism and may be a breaking change. > > > > > > > > > > > > > > > > We basically want to simply extend the current reporting here > > > with > > > > > the > > > > > > > rate > > > > > > > > metrics and the total metrics. > > > > > > > > > > > > > > > > Gyula > > > > > > > > > > > > > > > > On Wed, Sep 6, 2023 at 9:24 AM Matt Wang <wang...@163.com> > > > wrote: > > > > > > > > > > > > > > > > > Hi Gyula, > > > > > > > > > > > > > > > > > > +1 for this proposal. > > > > > > > > > > > > > > > > > > Do we need to add a metric to record the count of different > > > > > > > > > collectors? Now there is only a total count. For example, > > > > > > > > > for G1, there is no way to distinguish whether it is the > > > > > > > > > young generation or the old generation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Matt Wang > > > > > > > > > > > > > > > > > > > > > > > > > > > ---- Replied Message ---- > > > > > > > > > | From | Gyula Fóra<gyula.f...@gmail.com> | > > > > > > > > > | Date | 09/6/2023 15:03 | > > > > > > > > > | To | <dev@flink.apache.org> | > > > > > > > > > | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics | > > > > > > > > > Thanks Xintong! > > > > > > > > > > > > > > > > > > Just so I understand correctly, do you suggest adding a > > metric > > > > for > > > > > > > > > delta(Time) / delta(Count) since the last reporting ? > > > > > > > > > <Collector>.TimePerGc or <Collector>.AverageTime would make > > > > sense. > > > > > > > > > AverageTime may be a bit nicer :) > > > > > > > > > > > > > > > > > > My only concern is how useful this will be in reality. If > > there > > > > are > > > > > > > only > > > > > > > > > (or several) long pauses then the msPerSec metrics will > show > > it > > > > > > > already, > > > > > > > > > and if there is a single long pause that may not be shown > at > > > all > > > > if > > > > > > > there > > > > > > > > > are several shorter pauses as well with this metric. > > > > > > > > > > > > > > > > > > Gyula > > > > > > > > > > > > > > > > > > On Wed, Sep 6, 2023 at 8:46 AM Xintong Song < > > > > tonysong...@gmail.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Thanks for bringing this up, Gyula. > > > > > > > > > > > > > > > > > > The proposed changes make sense to me. +1 for them. > > > > > > > > > > > > > > > > > > In addition to the proposed changes, I wonder if we should > > also > > > > add > > > > > > > > > something like timePerGc? This would help understand > whether > > > > there > > > > > > are > > > > > > > > long > > > > > > > > > pauses, due to GC STW, that may lead to rpc > unresponsiveness > > > and > > > > > > > > heartbeat > > > > > > > > > timeouts. Ideally, we'd like to understand the max pause > time > > > per > > > > > STW > > > > > > > in > > > > > > > > a > > > > > > > > > recent time window. However, I don't see an easy way to > > > separate > > > > > the > > > > > > > > pause > > > > > > > > > time of each STW. Deriving the overall time per GC from the > > > > > existing > > > > > > > > > metrics (time-increment / count-increment) seems to be a > good > > > > > > > > alternative. > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > Xintong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Sep 6, 2023 at 2:16 PM Rui Fan < > 1996fan...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Thanks for the clarification! > > > > > > > > > > > > > > > > > > By default the meterview measures for 1 minute sounds good > to > > > me! > > > > > > > > > > > > > > > > > > +1 for this proposal. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Rui > > > > > > > > > > > > > > > > > > On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra < > > > gyula.f...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Thanks for the feedback Rui, > > > > > > > > > > > > > > > > > > The rates would be computed using the MeterView class (like > > for > > > > any > > > > > > > > > other > > > > > > > > > rate metric), just because we report the value per second > it > > > > > doesn't > > > > > > > > > mean > > > > > > > > > that we measure in a second granularity. > > > > > > > > > By default the meterview measures for 1 minute and then we > > > > > calculate > > > > > > > > > the > > > > > > > > > per second rates, but we can increase the timespan if > > > necessary. > > > > > > > > > > > > > > > > > > So I don't think we run into this problem in practice and > we > > > can > > > > > keep > > > > > > > > > the > > > > > > > > > metric aligned with other time rate metrics like > > > busyTimeMsPerSec > > > > > > etc. > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > Gyula > > > > > > > > > > > > > > > > > > On Wed, Sep 6, 2023 at 4:55 AM Rui Fan < > 1996fan...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Hi Gyula, > > > > > > > > > > > > > > > > > > +1 for this proposal. The current GC metric is really > > > unfriendly. > > > > > > > > > > > > > > > > > > I have a concern with your proposed rate metric: the rate > is > > > > > > > > > perSecond > > > > > > > > > instead of per minute. I'm unsure whether it's suitable for > > GC > > > > > > > > > metric. > > > > > > > > > > > > > > > > > > There are two reasons why I suspect perSecond may not be > well > > > > > > > > > compatible with GC metric: > > > > > > > > > > > > > > > > > > 1. GCs are usually infrequent and may only occur for a > small > > > > number > > > > > > > > > of time periods within a minute. > > > > > > > > > > > > > > > > > > Metrics are collected periodically, for example, reported > > every > > > > > > > > > minute. > > > > > > > > > If the result reported by the GC metric is 1s/perSecond, it > > > does > > > > > not > > > > > > > > > mean that the GC of the TM is serious, because there may be > > no > > > GC > > > > > > > > > in the remaining 59s. > > > > > > > > > > > > > > > > > > On the contrary, the GC metric reports 0s/perSecond, which > > does > > > > not > > > > > > > > > mean that the GC of the TM is not serious, and the GC may > be > > > very > > > > > > > > > serious in the remaining 59s. > > > > > > > > > > > > > > > > > > 2. Stop-the-world may cause the metric to fail(delay) to > > report > > > > > > > > > > > > > > > > > > The TM will stop the world during GC, especially full GC. > It > > > > means > > > > > > > > > the metric cannot be collected or reported during full GC. > > > > > > > > > > > > > > > > > > So the collected GC metric may never be 1s/perSecond. This > > > metric > > > > > > > > > may always be good because the metric will only be reported > > > when > > > > > > > > > the GC is not severe. > > > > > > > > > > > > > > > > > > > > > > > > > > > If these concerns make sense, how about updating the GC > rate > > > > > > > > > at minute level? > > > > > > > > > > > > > > > > > > We can define the type to Gauge for TimeMsPerMiunte, and > > > updating > > > > > > > > > this Gauge every second, it is: > > > > > > > > > GC Total.Time of current time - GC total time of one miunte > > > ago. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Rui > > > > > > > > > > > > > > > > > > On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels < > > > > m...@apache.org > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Hi Gyula, > > > > > > > > > > > > > > > > > > +1 The proposed changes make sense and are in line with > what > > is > > > > > > > > > available for other metrics, e.g. number of records > > processed. > > > > > > > > > > > > > > > > > > -Max > > > > > > > > > > > > > > > > > > On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra < > > > gyula.f...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Hi Devs, > > > > > > > > > > > > > > > > > > I would like to start a discussion on FLIP-361: Improve GC > > > > > > > > > Metrics > > > > > > > > > [1]. > > > > > > > > > > > > > > > > > > The current Flink GC metrics [2] are not very useful for > > > > > > > > > monitoring > > > > > > > > > purposes as they require post processing logic that is also > > > > > > > > > dependent > > > > > > > > > on > > > > > > > > > the current runtime environment. > > > > > > > > > > > > > > > > > > Problems: > > > > > > > > > - Total time is not very relevant for long running > > > applications, > > > > > > > > > only > > > > > > > > > the > > > > > > > > > rate of change (msPerSec) > > > > > > > > > - In most cases it's best to simply aggregate the > time/count > > > > > > > > > across > > > > > > > > > the > > > > > > > > > different GabrageCollectors, however the specific > collectors > > > are > > > > > > > > > dependent > > > > > > > > > on the current Java runtime > > > > > > > > > > > > > > > > > > We propose to improve the current situation by: > > > > > > > > > - Exposing rate metrics per GarbageCollector > > > > > > > > > - Exposing aggregated Total time/count/rate metrics > > > > > > > > > > > > > > > > > > These new metrics are all derived from the existing ones > with > > > > > > > > > minimal > > > > > > > > > overhead. > > > > > > > > > > > > > > > > > > Looking forward to your feedback. > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > Gyula > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-361*3A*Improve*GC*Metrics__;JSsrKw!!IKRxdwAv5BmarQ!cw2nH8ALLzxCrHdoE6Uw58Nk5dcPvUM-Y_hcjiIszqPGjDAKi-pJUhwEnQ5uBjC_m73XC-yu6183qznWNCA$ > > > > > > > > > [2] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/*garbagecollection__;Iw!!IKRxdwAv5BmarQ!cw2nH8ALLzxCrHdoE6Uw58Nk5dcPvUM-Y_hcjiIszqPGjDAKi-pJUhwEnQ5uBjC_m73XC-yu61837rPCNDo$ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >