Re: [DISCUSS] FLIP-361: Improve GC Metrics

Gyula Fóra Wed, 13 Sep 2023 11:28:58 -0700

Hi Venkata,

Unfortunately the GC CPU Usage , percentage or otherwise, is not something
that the JVM GC beans expose. So I am afraid we cannot easily add that.


Regards
Gyula

On Wed, Sep 13, 2023 at 7:59 PM Venkatakrishnan Sowrirajan <[email protected]>
wrote:

> Hi Gyula,
>
> Thanks for driving this FLIP.
>
> The proposal looks good to me. Only one minor suggestion I have is, can we
> also include the % GC time spent wrt the overall CPU time especially useful
> in the cases of TM which helps in easily identifying issues related to GC.
> Thoughts?
>
> Regards
> Venkata krishnan
>
>
> On Wed, Sep 13, 2023 at 6:13 AM Gyula Fóra <[email protected]> wrote:
>
> > Thanks for all the feedback, I will start the vote on this.
> >
> > Gyula
> >
> > On Wed, Sep 6, 2023 at 11:11 AM Xintong Song <[email protected]>
> > wrote:
> >
> > > >
> > > > I added the average time metric to the FLIP document. I also included
> > it
> > > > for the aggregate (total) across all collectors. But maybe it doesn't
> > > make
> > > > too much sense as collection times usually differ greatly depending
> on
> > > the
> > > > collector.
> > > >
> > >
> > > LGTM
> > >
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Wed, Sep 6, 2023 at 4:31 PM Gyula Fóra <[email protected]>
> wrote:
> > >
> > > > I added the average time metric to the FLIP document. I also included
> > it
> > > > for the aggregate (total) across all collectors. But maybe it doesn't
> > > make
> > > > too much sense as collection times usually differ greatly depending
> on
> > > the
> > > > collector.
> > > >
> > > > Gyula
> > > >
> > > > On Wed, Sep 6, 2023 at 10:21 AM Xintong Song <[email protected]>
> > > > wrote:
> > > >
> > > > > Thank you :)
> > > > >
> > > > > Best,
> > > > >
> > > > > Xintong
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Sep 6, 2023 at 4:17 PM Gyula Fóra <[email protected]>
> > > wrote:
> > > > >
> > > > > > Makes sense Xintong, I am happy to extend the proposal with the
> > > average
> > > > > gc
> > > > > > time metric +1
> > > > > >
> > > > > > Gyula
> > > > > >
> > > > > > On Wed, Sep 6, 2023 at 10:09 AM Xintong Song <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > >
> > > > > > > > Just so I understand correctly, do you suggest adding a
> metric
> > > for
> > > > > > > > delta(Time) / delta(Count) since the last reporting ?
> > > > > > > > <Collector>.TimePerGc or <Collector>.AverageTime would make
> > > sense.
> > > > > > > > AverageTime may be a bit nicer :)
> > > > > > > >
> > > > > > >
> > > > > > > Yes, that's what I mean.
> > > > > > >
> > > > > > > My only concern is how useful this will be in reality. If there
> > are
> > > > > only
> > > > > > > > (or several) long pauses then the msPerSec metrics will show
> it
> > > > > > already,
> > > > > > > > and if there is a single long pause that may not be shown at
> > all
> > > if
> > > > > > there
> > > > > > > > are several shorter pauses as well with this metric.
> > > > > > >
> > > > > > >
> > > > > > > Let's say we measure this for every minute and see a 900
> msPerSec
> > > > > (which
> > > > > > > means 54s within the minute are spent on GC). This may come
> from
> > a
> > > > > single
> > > > > > > GC that lasts for 54s, or 2 GCs each lasting for ~27s, or more
> > GCs
> > > > with
> > > > > > > less time each. As the default heartbeat timeout is 50s, the
> > former
> > > > > means
> > > > > > > there's likely a heartbeat timeout due to the GC pause, while
> the
> > > > > latter
> > > > > > > means otherwise.
> > > > > > >
> > > > > > >
> > > > > > > Mathematically, it is possible that there's 1 long pause
> together
> > > > with
> > > > > > > several short pauses within the same measurement period, making
> > the
> > > > > long
> > > > > > > pause not observable with AverageTime. However, from my
> > experience,
> > > > > such
> > > > > > a
> > > > > > > pattern is not normal in reality. My observation is that GCs
> > happen
> > > > at
> > > > > a
> > > > > > > similar time usually take a similar length of time. Admittedly,
> > > this
> > > > is
> > > > > > not
> > > > > > > a hard guarantee.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Xintong
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra <
> [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > > Matt Wang,
> > > > > > > >
> > > > > > > > I think the currently exposed info is all that is available
> > > through
> > > > > > > > GarbageCollectorMXBean. This FLIP does not aim to introduce a
> > new
> > > > > more
> > > > > > > > granular way of reporting the per collector metrics, that
> would
> > > > > > require a
> > > > > > > > new mechanism and may be a breaking change.
> > > > > > > >
> > > > > > > > We basically want to simply extend the current reporting here
> > > with
> > > > > the
> > > > > > > rate
> > > > > > > > metrics and the total metrics.
> > > > > > > >
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Wed, Sep 6, 2023 at 9:24 AM Matt Wang <[email protected]>
> > > wrote:
> > > > > > > >
> > > > > > > > > Hi Gyula,
> > > > > > > > >
> > > > > > > > > +1 for this proposal.
> > > > > > > > >
> > > > > > > > > Do we need to add a metric to record the count of different
> > > > > > > > > collectors? Now there is only a total count. For example,
> > > > > > > > > for G1, there is no way to distinguish whether it is the
> > > > > > > > > young generation or the old generation.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Matt Wang
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ---- Replied Message ----
> > > > > > > > > | From | Gyula Fóra<[email protected]> |
> > > > > > > > > | Date | 09/6/2023 15:03 |
> > > > > > > > > | To | <[email protected]> |
> > > > > > > > > | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics |
> > > > > > > > > Thanks Xintong!
> > > > > > > > >
> > > > > > > > > Just so I understand correctly, do you suggest adding a
> > metric
> > > > for
> > > > > > > > > delta(Time) / delta(Count) since the last reporting ?
> > > > > > > > > <Collector>.TimePerGc or <Collector>.AverageTime would make
> > > > sense.
> > > > > > > > > AverageTime may be a bit nicer :)
> > > > > > > > >
> > > > > > > > > My only concern is how useful this will be in reality. If
> > there
> > > > are
> > > > > > > only
> > > > > > > > > (or several) long pauses then the msPerSec metrics will
> show
> > it
> > > > > > > already,
> > > > > > > > > and if there is a single long pause that may not be shown
> at
> > > all
> > > > if
> > > > > > > there
> > > > > > > > > are several shorter pauses as well with this metric.
> > > > > > > > >
> > > > > > > > > Gyula
> > > > > > > > >
> > > > > > > > > On Wed, Sep 6, 2023 at 8:46 AM Xintong Song <
> > > > [email protected]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Thanks for bringing this up, Gyula.
> > > > > > > > >
> > > > > > > > > The proposed changes make sense to me. +1 for them.
> > > > > > > > >
> > > > > > > > > In addition to the proposed changes, I wonder if we should
> > also
> > > > add
> > > > > > > > > something like timePerGc? This would help understand
> whether
> > > > there
> > > > > > are
> > > > > > > > long
> > > > > > > > > pauses, due to GC STW, that may lead to rpc
> unresponsiveness
> > > and
> > > > > > > > heartbeat
> > > > > > > > > timeouts. Ideally, we'd like to understand the max pause
> time
> > > per
> > > > > STW
> > > > > > > in
> > > > > > > > a
> > > > > > > > > recent time window. However, I don't see an easy way to
> > > separate
> > > > > the
> > > > > > > > pause
> > > > > > > > > time of each STW. Deriving the overall time per GC from the
> > > > > existing
> > > > > > > > > metrics (time-increment / count-increment) seems to be a
> good
> > > > > > > > alternative.
> > > > > > > > > WDYT?
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > >
> > > > > > > > > Xintong
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <
> [email protected]
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > Thanks for the clarification!
> > > > > > > > >
> > > > > > > > > By default the meterview measures for 1 minute sounds good
> to
> > > me!
> > > > > > > > >
> > > > > > > > > +1 for this proposal.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Rui
> > > > > > > > >
> > > > > > > > > On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Thanks for the feedback Rui,
> > > > > > > > >
> > > > > > > > > The rates would be computed using the MeterView class (like
> > for
> > > > any
> > > > > > > > > other
> > > > > > > > > rate metric), just because we report the value per second
> it
> > > > > doesn't
> > > > > > > > > mean
> > > > > > > > > that we measure in a second granularity.
> > > > > > > > > By default the meterview measures for 1 minute and then we
> > > > > calculate
> > > > > > > > > the
> > > > > > > > > per second rates, but we can increase the timespan if
> > > necessary.
> > > > > > > > >
> > > > > > > > > So I don't think we run into this problem in practice and
> we
> > > can
> > > > > keep
> > > > > > > > > the
> > > > > > > > > metric aligned with other time rate metrics like
> > > busyTimeMsPerSec
> > > > > > etc.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Gyula
> > > > > > > > >
> > > > > > > > > On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <
> [email protected]
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Gyula,
> > > > > > > > >
> > > > > > > > > +1 for this proposal. The current GC metric is really
> > > unfriendly.
> > > > > > > > >
> > > > > > > > > I have a concern with your proposed rate metric: the rate
> is
> > > > > > > > > perSecond
> > > > > > > > > instead of per minute. I'm unsure whether it's suitable for
> > GC
> > > > > > > > > metric.
> > > > > > > > >
> > > > > > > > > There are two reasons why I suspect perSecond may not be
> well
> > > > > > > > > compatible with GC metric:
> > > > > > > > >
> > > > > > > > > 1. GCs are usually infrequent and may only occur for a
> small
> > > > number
> > > > > > > > > of time periods within a minute.
> > > > > > > > >
> > > > > > > > > Metrics are collected periodically, for example, reported
> > every
> > > > > > > > > minute.
> > > > > > > > > If the result reported by the GC metric is 1s/perSecond, it
> > > does
> > > > > not
> > > > > > > > > mean that the GC of the TM is serious, because there may be
> > no
> > > GC
> > > > > > > > > in the remaining 59s.
> > > > > > > > >
> > > > > > > > > On the contrary, the GC metric reports 0s/perSecond, which
> > does
> > > > not
> > > > > > > > > mean that the GC of the TM is not serious, and the GC may
> be
> > > very
> > > > > > > > > serious in the remaining 59s.
> > > > > > > > >
> > > > > > > > > 2. Stop-the-world may cause the metric to fail(delay) to
> > report
> > > > > > > > >
> > > > > > > > > The TM will stop the world during GC, especially full GC.
> It
> > > > means
> > > > > > > > > the metric cannot be collected or reported during full GC.
> > > > > > > > >
> > > > > > > > > So the collected GC metric may never be 1s/perSecond. This
> > > metric
> > > > > > > > > may always be good because the metric will only be reported
> > > when
> > > > > > > > > the GC is not severe.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > If these concerns make sense, how about updating the GC
> rate
> > > > > > > > > at minute level?
> > > > > > > > >
> > > > > > > > > We can define the type to Gauge for TimeMsPerMiunte, and
> > > updating
> > > > > > > > > this Gauge every second, it is:
> > > > > > > > > GC Total.Time of current time - GC total time of one miunte
> > > ago.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Rui
> > > > > > > > >
> > > > > > > > > On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels <
> > > > [email protected]
> > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Gyula,
> > > > > > > > >
> > > > > > > > > +1 The proposed changes make sense and are in line with
> what
> > is
> > > > > > > > > available for other metrics, e.g. number of records
> > processed.
> > > > > > > > >
> > > > > > > > > -Max
> > > > > > > > >
> > > > > > > > > On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra <
> > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Devs,
> > > > > > > > >
> > > > > > > > > I would like to start a discussion on FLIP-361: Improve GC
> > > > > > > > > Metrics
> > > > > > > > > [1].
> > > > > > > > >
> > > > > > > > > The current Flink GC metrics [2] are not very useful for
> > > > > > > > > monitoring
> > > > > > > > > purposes as they require post processing logic that is also
> > > > > > > > > dependent
> > > > > > > > > on
> > > > > > > > > the current runtime environment.
> > > > > > > > >
> > > > > > > > > Problems:
> > > > > > > > > - Total time is not very relevant for long running
> > > applications,
> > > > > > > > > only
> > > > > > > > > the
> > > > > > > > > rate of change (msPerSec)
> > > > > > > > > - In most cases it's best to simply aggregate the
> time/count
> > > > > > > > > across
> > > > > > > > > the
> > > > > > > > > different GabrageCollectors, however the specific
> collectors
> > > are
> > > > > > > > > dependent
> > > > > > > > > on the current Java runtime
> > > > > > > > >
> > > > > > > > > We propose to improve the current situation by:
> > > > > > > > > - Exposing rate metrics per GarbageCollector
> > > > > > > > > - Exposing aggregated Total time/count/rate metrics
> > > > > > > > >
> > > > > > > > > These new metrics are all derived from the existing ones
> with
> > > > > > > > > minimal
> > > > > > > > > overhead.
> > > > > > > > >
> > > > > > > > > Looking forward to your feedback.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Gyula
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-361*3A*Improve*GC*Metrics__;JSsrKw!!IKRxdwAv5BmarQ!cw2nH8ALLzxCrHdoE6Uw58Nk5dcPvUM-Y_hcjiIszqPGjDAKi-pJUhwEnQ5uBjC_m73XC-yu6183qznWNCA$
> > > > > > > > > [2]
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/*garbagecollection__;Iw!!IKRxdwAv5BmarQ!cw2nH8ALLzxCrHdoE6Uw58Nk5dcPvUM-Y_hcjiIszqPGjDAKi-pJUhwEnQ5uBjC_m73XC-yu61837rPCNDo$
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-361: Improve GC Metrics

Reply via email to