Thanks for bringing this up, Gyula.

The proposed changes make sense to me. +1 for them.

In addition to the proposed changes, I wonder if we should also add
something like timePerGc? This would help understand whether there are long
pauses, due to GC STW, that may lead to rpc unresponsiveness and heartbeat
timeouts. Ideally, we'd like to understand the max pause time per STW in a
recent time window. However, I don't see an easy way to separate the pause
time of each STW. Deriving the overall time per GC from the existing
metrics (time-increment / count-increment) seems to be a good alternative.
WDYT?

Best,

Xintong



On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <1996fan...@gmail.com> wrote:

> Thanks for the clarification!
>
> By default the meterview measures for 1 minute sounds good to me!
>
> +1 for this proposal.
>
> Best,
> Rui
>
> On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>
> > Thanks for the feedback Rui,
> >
> > The rates would be computed using the MeterView class (like for any other
> > rate metric), just because we report the value per second it doesn't mean
> > that we measure in a second granularity.
> > By default the meterview measures for 1 minute and then we calculate the
> > per second rates, but we can increase the timespan if necessary.
> >
> > So I don't think we run into this problem in practice and we can keep the
> > metric aligned with other time rate metrics like busyTimeMsPerSec etc.
> >
> > Cheers,
> > Gyula
> >
> > On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > Hi Gyula,
> > >
> > > +1 for this proposal. The current GC metric is really unfriendly.
> > >
> > > I have a concern with your proposed rate metric: the rate is perSecond
> > > instead of per minute. I'm unsure whether it's suitable for GC metric.
> > >
> > > There are two reasons why I suspect perSecond may not be well
> > > compatible with GC metric:
> > >
> > > 1. GCs are usually infrequent and may only occur for a small number
> > > of time periods within a minute.
> > >
> > > Metrics are collected periodically, for example, reported every minute.
> > > If the result reported by the GC metric is 1s/perSecond, it does not
> > > mean that the GC of the TM is serious, because there may be no GC
> > > in the remaining 59s.
> > >
> > > On the contrary, the GC metric reports 0s/perSecond, which does not
> > > mean that the GC of the TM is not serious, and the GC may be very
> > > serious in the remaining 59s.
> > >
> > > 2. Stop-the-world may cause the metric to fail(delay) to report
> > >
> > > The TM will stop the world during GC, especially full GC. It means
> > > the metric cannot be collected or reported during full GC.
> > >
> > > So the collected GC metric may never be 1s/perSecond. This metric
> > > may always be good because the metric will only be reported when
> > > the GC is not severe.
> > >
> > >
> > > If these concerns make sense, how about updating the GC rate
> > > at minute level?
> > >
> > > We can define the type to Gauge for TimeMsPerMiunte, and updating
> > > this Gauge every second, it is:
> > > GC Total.Time of current time - GC total time of one miunte ago.
> > >
> > > Best,
> > > Rui
> > >
> > > On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels <m...@apache.org>
> > wrote:
> > >
> > > > Hi Gyula,
> > > >
> > > > +1 The proposed changes make sense and are in line with what is
> > > > available for other metrics, e.g. number of records processed.
> > > >
> > > > -Max
> > > >
> > > > On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra <gyula.f...@gmail.com>
> > wrote:
> > > > >
> > > > > Hi Devs,
> > > > >
> > > > > I would like to start a discussion on FLIP-361: Improve GC Metrics
> > [1].
> > > > >
> > > > > The current Flink GC metrics [2] are not very useful for monitoring
> > > > > purposes as they require post processing logic that is also
> dependent
> > > on
> > > > > the current runtime environment.
> > > > >
> > > > > Problems:
> > > > >  - Total time is not very relevant for long running applications,
> > only
> > > > the
> > > > > rate of change (msPerSec)
> > > > >  - In most cases it's best to simply aggregate the time/count
> across
> > > the
> > > > > different GabrageCollectors, however the specific collectors are
> > > > dependent
> > > > > on the current Java runtime
> > > > >
> > > > > We propose to improve the current situation by:
> > > > >  - Exposing rate metrics per GarbageCollector
> > > > >  - Exposing aggregated Total time/count/rate metrics
> > > > >
> > > > > These new metrics are all derived from the existing ones with
> minimal
> > > > > overhead.
> > > > >
> > > > > Looking forward to your feedback.
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
> > > > > [2]
> > > > >
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection
> > > >
> > >
> >
>

Reply via email to