Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-13 Thread Gyula Fóra
cally, it is possible that there's 1 long pause
> together
> > > > with
> > > > > > > several short pauses within the same measurement period, making
> > the
> > > > > long
> > > > > > > pause not observable with AverageTime. However, from my
> > experience,
> > > > > such
> > > > > > a
> > > > > > > pattern is not normal in reality. My observation is that GCs
> > happen
> > > > at
> > > > > a
> > > > > > > similar time usually take a similar length of time. Admittedly,
> > > this
> > > > is
> > > > > > not
> > > > > > > a hard guarantee.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Xintong
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra <
> gyula.f...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Matt Wang,
> > > > > > > >
> > > > > > > > I think the currently exposed info is all that is available
> > > through
> > > > > > > > GarbageCollectorMXBean. This FLIP does not aim to introduce a
> > new
> > > > > more
> > > > > > > > granular way of reporting the per collector metrics, that
> would
> > > > > > require a
> > > > > > > > new mechanism and may be a breaking change.
> > > > > > > >
> > > > > > > > We basically want to simply extend the current reporting here
> > > with
> > > > > the
> > > > > > > rate
> > > > > > > > metrics and the total metrics.
> > > > > > > >
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Wed, Sep 6, 2023 at 9:24 AM Matt Wang 
> > > wrote:
> > > > > > > >
> > > > > > > > > Hi Gyula,
> > > > > > > > >
> > > > > > > > > +1 for this proposal.
> > > > > > > > >
> > > > > > > > > Do we need to add a metric to record the count of different
> > > > > > > > > collectors? Now there is only a total count. For example,
> > > > > > > > > for G1, there is no way to distinguish whether it is the
> > > > > > > > > young generation or the old generation.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Matt Wang
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  Replied Message 
> > > > > > > > > | From | Gyula Fóra |
> > > > > > > > > | Date | 09/6/2023 15:03 |
> > > > > > > > > | To |  |
> > > > > > > > > | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics |
> > > > > > > > > Thanks Xintong!
> > > > > > > > >
> > > > > > > > > Just so I understand correctly, do you suggest adding a
> > metric
> > > > for
> > > > > > > > > delta(Time) / delta(Count) since the last reporting ?
> > > > > > > > > .TimePerGc or .AverageTime would make
> > > > sense.
> > > > > > > > > AverageTime may be a bit nicer :)
> > > > > > > > >
> > > > > > > > > My only concern is how useful this will be in reality. If
> > there
> > > > are
> > > > > > > only
> > > > > > > > > (or several) long pauses then the msPerSec metrics will
> show
> > it
> > > > > > > already,
> > > > > > > > > and if there is a single long pause that may not be shown
> at
> > > all
> > > > if
> > > > > > > there
> > > > > > > > > ar

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-13 Thread Venkatakrishnan Sowrirajan
t; > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra 
> > > > wrote:
> > > > > >
> > > > > > > Matt Wang,
> > > > > > >
> > > > > > > I think the currently exposed info is all that is available
> > through
> > > > > > > GarbageCollectorMXBean. This FLIP does not aim to introduce a
> new
> > > > more
> > > > > > > granular way of reporting the per collector metrics, that would
> > > > > require a
> > > > > > > new mechanism and may be a breaking change.
> > > > > > >
> > > > > > > We basically want to simply extend the current reporting here
> > with
> > > > the
> > > > > > rate
> > > > > > > metrics and the total metrics.
> > > > > > >
> > > > > > > Gyula
> > > > > > >
> > > > > > > On Wed, Sep 6, 2023 at 9:24 AM Matt Wang 
> > wrote:
> > > > > > >
> > > > > > > > Hi Gyula,
> > > > > > > >
> > > > > > > > +1 for this proposal.
> > > > > > > >
> > > > > > > > Do we need to add a metric to record the count of different
> > > > > > > > collectors? Now there is only a total count. For example,
> > > > > > > > for G1, there is no way to distinguish whether it is the
> > > > > > > > young generation or the old generation.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Matt Wang
> > > > > > > >
> > > > > > > >
> > > > > > > >  Replied Message 
> > > > > > > > | From | Gyula Fóra |
> > > > > > > > | Date | 09/6/2023 15:03 |
> > > > > > > > | To |  |
> > > > > > > > | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics |
> > > > > > > > Thanks Xintong!
> > > > > > > >
> > > > > > > > Just so I understand correctly, do you suggest adding a
> metric
> > > for
> > > > > > > > delta(Time) / delta(Count) since the last reporting ?
> > > > > > > > .TimePerGc or .AverageTime would make
> > > sense.
> > > > > > > > AverageTime may be a bit nicer :)
> > > > > > > >
> > > > > > > > My only concern is how useful this will be in reality. If
> there
> > > are
> > > > > > only
> > > > > > > > (or several) long pauses then the msPerSec metrics will show
> it
> > > > > > already,
> > > > > > > > and if there is a single long pause that may not be shown at
> > all
> > > if
> > > > > > there
> > > > > > > > are several shorter pauses as well with this metric.
> > > > > > > >
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Wed, Sep 6, 2023 at 8:46 AM Xintong Song <
> > > tonysong...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Thanks for bringing this up, Gyula.
> > > > > > > >
> > > > > > > > The proposed changes make sense to me. +1 for them.
> > > > > > > >
> > > > > > > > In addition to the proposed changes, I wonder if we should
> also
> > > add
> > > > > > > > something like timePerGc? This would help understand whether
> > > there
> > > > > are
> > > > > > > long
> > > > > > > > pauses, due to GC STW, that may lead to rpc unresponsiveness
> > and
> > > > > > > heartbeat
> > > > > > > > timeouts. Ideally, we'd like to understand the max pause time
> > per
> > > > STW
> > > > > > in
> > > > > > > a
> > > > > > >

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-13 Thread Gyula Fóra
Thanks for all the feedback, I will start the vote on this.

Gyula

On Wed, Sep 6, 2023 at 11:11 AM Xintong Song  wrote:

> >
> > I added the average time metric to the FLIP document. I also included it
> > for the aggregate (total) across all collectors. But maybe it doesn't
> make
> > too much sense as collection times usually differ greatly depending on
> the
> > collector.
> >
>
> LGTM
>
>
> Best,
>
> Xintong
>
>
>
> On Wed, Sep 6, 2023 at 4:31 PM Gyula Fóra  wrote:
>
> > I added the average time metric to the FLIP document. I also included it
> > for the aggregate (total) across all collectors. But maybe it doesn't
> make
> > too much sense as collection times usually differ greatly depending on
> the
> > collector.
> >
> > Gyula
> >
> > On Wed, Sep 6, 2023 at 10:21 AM Xintong Song 
> > wrote:
> >
> > > Thank you :)
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Wed, Sep 6, 2023 at 4:17 PM Gyula Fóra 
> wrote:
> > >
> > > > Makes sense Xintong, I am happy to extend the proposal with the
> average
> > > gc
> > > > time metric +1
> > > >
> > > > Gyula
> > > >
> > > > On Wed, Sep 6, 2023 at 10:09 AM Xintong Song 
> > > > wrote:
> > > >
> > > > > >
> > > > > > Just so I understand correctly, do you suggest adding a metric
> for
> > > > > > delta(Time) / delta(Count) since the last reporting ?
> > > > > > .TimePerGc or .AverageTime would make
> sense.
> > > > > > AverageTime may be a bit nicer :)
> > > > > >
> > > > >
> > > > > Yes, that's what I mean.
> > > > >
> > > > > My only concern is how useful this will be in reality. If there are
> > > only
> > > > > > (or several) long pauses then the msPerSec metrics will show it
> > > > already,
> > > > > > and if there is a single long pause that may not be shown at all
> if
> > > > there
> > > > > > are several shorter pauses as well with this metric.
> > > > >
> > > > >
> > > > > Let's say we measure this for every minute and see a 900 msPerSec
> > > (which
> > > > > means 54s within the minute are spent on GC). This may come from a
> > > single
> > > > > GC that lasts for 54s, or 2 GCs each lasting for ~27s, or more GCs
> > with
> > > > > less time each. As the default heartbeat timeout is 50s, the former
> > > means
> > > > > there's likely a heartbeat timeout due to the GC pause, while the
> > > latter
> > > > > means otherwise.
> > > > >
> > > > >
> > > > > Mathematically, it is possible that there's 1 long pause together
> > with
> > > > > several short pauses within the same measurement period, making the
> > > long
> > > > > pause not observable with AverageTime. However, from my experience,
> > > such
> > > > a
> > > > > pattern is not normal in reality. My observation is that GCs happen
> > at
> > > a
> > > > > similar time usually take a similar length of time. Admittedly,
> this
> > is
> > > > not
> > > > > a hard guarantee.
> > > > >
> > > > >
> > > > > Best,
> > > > >
> > > > > Xintong
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra 
> > > wrote:
> > > > >
> > > > > > Matt Wang,
> > > > > >
> > > > > > I think the currently exposed info is all that is available
> through
> > > > > > GarbageCollectorMXBean. This FLIP does not aim to introduce a new
> > > more
> > > > > > granular way of reporting the per collector metrics, that would
> > > > require a
> > > > > > new mechanism and may be a breaking change.
> > > > > >
> > > > > > We basically want to simply extend the current reporting here
> with
> > > the
> > > > > rate
> > > > > > metrics and the total metrics.
> > > > > >
> > > > > > Gyula
> > > > > 

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-06 Thread Xintong Song
>
> I added the average time metric to the FLIP document. I also included it
> for the aggregate (total) across all collectors. But maybe it doesn't make
> too much sense as collection times usually differ greatly depending on the
> collector.
>

LGTM


Best,

Xintong



On Wed, Sep 6, 2023 at 4:31 PM Gyula Fóra  wrote:

> I added the average time metric to the FLIP document. I also included it
> for the aggregate (total) across all collectors. But maybe it doesn't make
> too much sense as collection times usually differ greatly depending on the
> collector.
>
> Gyula
>
> On Wed, Sep 6, 2023 at 10:21 AM Xintong Song 
> wrote:
>
> > Thank you :)
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Wed, Sep 6, 2023 at 4:17 PM Gyula Fóra  wrote:
> >
> > > Makes sense Xintong, I am happy to extend the proposal with the average
> > gc
> > > time metric +1
> > >
> > > Gyula
> > >
> > > On Wed, Sep 6, 2023 at 10:09 AM Xintong Song 
> > > wrote:
> > >
> > > > >
> > > > > Just so I understand correctly, do you suggest adding a metric for
> > > > > delta(Time) / delta(Count) since the last reporting ?
> > > > > .TimePerGc or .AverageTime would make sense.
> > > > > AverageTime may be a bit nicer :)
> > > > >
> > > >
> > > > Yes, that's what I mean.
> > > >
> > > > My only concern is how useful this will be in reality. If there are
> > only
> > > > > (or several) long pauses then the msPerSec metrics will show it
> > > already,
> > > > > and if there is a single long pause that may not be shown at all if
> > > there
> > > > > are several shorter pauses as well with this metric.
> > > >
> > > >
> > > > Let's say we measure this for every minute and see a 900 msPerSec
> > (which
> > > > means 54s within the minute are spent on GC). This may come from a
> > single
> > > > GC that lasts for 54s, or 2 GCs each lasting for ~27s, or more GCs
> with
> > > > less time each. As the default heartbeat timeout is 50s, the former
> > means
> > > > there's likely a heartbeat timeout due to the GC pause, while the
> > latter
> > > > means otherwise.
> > > >
> > > >
> > > > Mathematically, it is possible that there's 1 long pause together
> with
> > > > several short pauses within the same measurement period, making the
> > long
> > > > pause not observable with AverageTime. However, from my experience,
> > such
> > > a
> > > > pattern is not normal in reality. My observation is that GCs happen
> at
> > a
> > > > similar time usually take a similar length of time. Admittedly, this
> is
> > > not
> > > > a hard guarantee.
> > > >
> > > >
> > > > Best,
> > > >
> > > > Xintong
> > > >
> > > >
> > > >
> > > > On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra 
> > wrote:
> > > >
> > > > > Matt Wang,
> > > > >
> > > > > I think the currently exposed info is all that is available through
> > > > > GarbageCollectorMXBean. This FLIP does not aim to introduce a new
> > more
> > > > > granular way of reporting the per collector metrics, that would
> > > require a
> > > > > new mechanism and may be a breaking change.
> > > > >
> > > > > We basically want to simply extend the current reporting here with
> > the
> > > > rate
> > > > > metrics and the total metrics.
> > > > >
> > > > > Gyula
> > > > >
> > > > > On Wed, Sep 6, 2023 at 9:24 AM Matt Wang  wrote:
> > > > >
> > > > > > Hi Gyula,
> > > > > >
> > > > > > +1 for this proposal.
> > > > > >
> > > > > > Do we need to add a metric to record the count of different
> > > > > > collectors? Now there is only a total count. For example,
> > > > > > for G1, there is no way to distinguish whether it is the
> > > > > > young generation or the old generation.
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > &g

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-06 Thread Gyula Fóra
I added the average time metric to the FLIP document. I also included it
for the aggregate (total) across all collectors. But maybe it doesn't make
too much sense as collection times usually differ greatly depending on the
collector.

Gyula

On Wed, Sep 6, 2023 at 10:21 AM Xintong Song  wrote:

> Thank you :)
>
> Best,
>
> Xintong
>
>
>
> On Wed, Sep 6, 2023 at 4:17 PM Gyula Fóra  wrote:
>
> > Makes sense Xintong, I am happy to extend the proposal with the average
> gc
> > time metric +1
> >
> > Gyula
> >
> > On Wed, Sep 6, 2023 at 10:09 AM Xintong Song 
> > wrote:
> >
> > > >
> > > > Just so I understand correctly, do you suggest adding a metric for
> > > > delta(Time) / delta(Count) since the last reporting ?
> > > > .TimePerGc or .AverageTime would make sense.
> > > > AverageTime may be a bit nicer :)
> > > >
> > >
> > > Yes, that's what I mean.
> > >
> > > My only concern is how useful this will be in reality. If there are
> only
> > > > (or several) long pauses then the msPerSec metrics will show it
> > already,
> > > > and if there is a single long pause that may not be shown at all if
> > there
> > > > are several shorter pauses as well with this metric.
> > >
> > >
> > > Let's say we measure this for every minute and see a 900 msPerSec
> (which
> > > means 54s within the minute are spent on GC). This may come from a
> single
> > > GC that lasts for 54s, or 2 GCs each lasting for ~27s, or more GCs with
> > > less time each. As the default heartbeat timeout is 50s, the former
> means
> > > there's likely a heartbeat timeout due to the GC pause, while the
> latter
> > > means otherwise.
> > >
> > >
> > > Mathematically, it is possible that there's 1 long pause together with
> > > several short pauses within the same measurement period, making the
> long
> > > pause not observable with AverageTime. However, from my experience,
> such
> > a
> > > pattern is not normal in reality. My observation is that GCs happen at
> a
> > > similar time usually take a similar length of time. Admittedly, this is
> > not
> > > a hard guarantee.
> > >
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra 
> wrote:
> > >
> > > > Matt Wang,
> > > >
> > > > I think the currently exposed info is all that is available through
> > > > GarbageCollectorMXBean. This FLIP does not aim to introduce a new
> more
> > > > granular way of reporting the per collector metrics, that would
> > require a
> > > > new mechanism and may be a breaking change.
> > > >
> > > > We basically want to simply extend the current reporting here with
> the
> > > rate
> > > > metrics and the total metrics.
> > > >
> > > > Gyula
> > > >
> > > > On Wed, Sep 6, 2023 at 9:24 AM Matt Wang  wrote:
> > > >
> > > > > Hi Gyula,
> > > > >
> > > > > +1 for this proposal.
> > > > >
> > > > > Do we need to add a metric to record the count of different
> > > > > collectors? Now there is only a total count. For example,
> > > > > for G1, there is no way to distinguish whether it is the
> > > > > young generation or the old generation.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Best,
> > > > > Matt Wang
> > > > >
> > > > >
> > > > >  Replied Message 
> > > > > | From | Gyula Fóra |
> > > > > | Date | 09/6/2023 15:03 |
> > > > > | To |  |
> > > > > | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics |
> > > > > Thanks Xintong!
> > > > >
> > > > > Just so I understand correctly, do you suggest adding a metric for
> > > > > delta(Time) / delta(Count) since the last reporting ?
> > > > > .TimePerGc or .AverageTime would make sense.
> > > > > AverageTime may be a bit nicer :)
> > > > >
> > > > > My only concern is how useful this will be in reality. If there are
> > > only
> > > > > (or several) long pauses then the m

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-06 Thread Xintong Song
Thank you :)

Best,

Xintong



On Wed, Sep 6, 2023 at 4:17 PM Gyula Fóra  wrote:

> Makes sense Xintong, I am happy to extend the proposal with the average gc
> time metric +1
>
> Gyula
>
> On Wed, Sep 6, 2023 at 10:09 AM Xintong Song 
> wrote:
>
> > >
> > > Just so I understand correctly, do you suggest adding a metric for
> > > delta(Time) / delta(Count) since the last reporting ?
> > > .TimePerGc or .AverageTime would make sense.
> > > AverageTime may be a bit nicer :)
> > >
> >
> > Yes, that's what I mean.
> >
> > My only concern is how useful this will be in reality. If there are only
> > > (or several) long pauses then the msPerSec metrics will show it
> already,
> > > and if there is a single long pause that may not be shown at all if
> there
> > > are several shorter pauses as well with this metric.
> >
> >
> > Let's say we measure this for every minute and see a 900 msPerSec (which
> > means 54s within the minute are spent on GC). This may come from a single
> > GC that lasts for 54s, or 2 GCs each lasting for ~27s, or more GCs with
> > less time each. As the default heartbeat timeout is 50s, the former means
> > there's likely a heartbeat timeout due to the GC pause, while the latter
> > means otherwise.
> >
> >
> > Mathematically, it is possible that there's 1 long pause together with
> > several short pauses within the same measurement period, making the long
> > pause not observable with AverageTime. However, from my experience, such
> a
> > pattern is not normal in reality. My observation is that GCs happen at a
> > similar time usually take a similar length of time. Admittedly, this is
> not
> > a hard guarantee.
> >
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra  wrote:
> >
> > > Matt Wang,
> > >
> > > I think the currently exposed info is all that is available through
> > > GarbageCollectorMXBean. This FLIP does not aim to introduce a new more
> > > granular way of reporting the per collector metrics, that would
> require a
> > > new mechanism and may be a breaking change.
> > >
> > > We basically want to simply extend the current reporting here with the
> > rate
> > > metrics and the total metrics.
> > >
> > > Gyula
> > >
> > > On Wed, Sep 6, 2023 at 9:24 AM Matt Wang  wrote:
> > >
> > > > Hi Gyula,
> > > >
> > > > +1 for this proposal.
> > > >
> > > > Do we need to add a metric to record the count of different
> > > > collectors? Now there is only a total count. For example,
> > > > for G1, there is no way to distinguish whether it is the
> > > > young generation or the old generation.
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Best,
> > > > Matt Wang
> > > >
> > > >
> > > >  Replied Message 
> > > > | From | Gyula Fóra |
> > > > | Date | 09/6/2023 15:03 |
> > > > | To |  |
> > > > | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics |
> > > > Thanks Xintong!
> > > >
> > > > Just so I understand correctly, do you suggest adding a metric for
> > > > delta(Time) / delta(Count) since the last reporting ?
> > > > .TimePerGc or .AverageTime would make sense.
> > > > AverageTime may be a bit nicer :)
> > > >
> > > > My only concern is how useful this will be in reality. If there are
> > only
> > > > (or several) long pauses then the msPerSec metrics will show it
> > already,
> > > > and if there is a single long pause that may not be shown at all if
> > there
> > > > are several shorter pauses as well with this metric.
> > > >
> > > > Gyula
> > > >
> > > > On Wed, Sep 6, 2023 at 8:46 AM Xintong Song 
> > > wrote:
> > > >
> > > > Thanks for bringing this up, Gyula.
> > > >
> > > > The proposed changes make sense to me. +1 for them.
> > > >
> > > > In addition to the proposed changes, I wonder if we should also add
> > > > something like timePerGc? This would help understand whether there
> are
> > > long
> > > > pauses, due to GC STW, that may lead to rpc unresponsiveness and
> > > heartbeat
> > > >

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-06 Thread Gyula Fóra
Makes sense Xintong, I am happy to extend the proposal with the average gc
time metric +1

Gyula

On Wed, Sep 6, 2023 at 10:09 AM Xintong Song  wrote:

> >
> > Just so I understand correctly, do you suggest adding a metric for
> > delta(Time) / delta(Count) since the last reporting ?
> > .TimePerGc or .AverageTime would make sense.
> > AverageTime may be a bit nicer :)
> >
>
> Yes, that's what I mean.
>
> My only concern is how useful this will be in reality. If there are only
> > (or several) long pauses then the msPerSec metrics will show it already,
> > and if there is a single long pause that may not be shown at all if there
> > are several shorter pauses as well with this metric.
>
>
> Let's say we measure this for every minute and see a 900 msPerSec (which
> means 54s within the minute are spent on GC). This may come from a single
> GC that lasts for 54s, or 2 GCs each lasting for ~27s, or more GCs with
> less time each. As the default heartbeat timeout is 50s, the former means
> there's likely a heartbeat timeout due to the GC pause, while the latter
> means otherwise.
>
>
> Mathematically, it is possible that there's 1 long pause together with
> several short pauses within the same measurement period, making the long
> pause not observable with AverageTime. However, from my experience, such a
> pattern is not normal in reality. My observation is that GCs happen at a
> similar time usually take a similar length of time. Admittedly, this is not
> a hard guarantee.
>
>
> Best,
>
> Xintong
>
>
>
> On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra  wrote:
>
> > Matt Wang,
> >
> > I think the currently exposed info is all that is available through
> > GarbageCollectorMXBean. This FLIP does not aim to introduce a new more
> > granular way of reporting the per collector metrics, that would require a
> > new mechanism and may be a breaking change.
> >
> > We basically want to simply extend the current reporting here with the
> rate
> > metrics and the total metrics.
> >
> > Gyula
> >
> > On Wed, Sep 6, 2023 at 9:24 AM Matt Wang  wrote:
> >
> > > Hi Gyula,
> > >
> > > +1 for this proposal.
> > >
> > > Do we need to add a metric to record the count of different
> > > collectors? Now there is only a total count. For example,
> > > for G1, there is no way to distinguish whether it is the
> > > young generation or the old generation.
> > >
> > >
> > >
> > > --
> > >
> > > Best,
> > > Matt Wang
> > >
> > >
> > >  Replied Message 
> > > | From | Gyula Fóra |
> > > | Date | 09/6/2023 15:03 |
> > > | To |  |
> > > | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics |
> > > Thanks Xintong!
> > >
> > > Just so I understand correctly, do you suggest adding a metric for
> > > delta(Time) / delta(Count) since the last reporting ?
> > > .TimePerGc or .AverageTime would make sense.
> > > AverageTime may be a bit nicer :)
> > >
> > > My only concern is how useful this will be in reality. If there are
> only
> > > (or several) long pauses then the msPerSec metrics will show it
> already,
> > > and if there is a single long pause that may not be shown at all if
> there
> > > are several shorter pauses as well with this metric.
> > >
> > > Gyula
> > >
> > > On Wed, Sep 6, 2023 at 8:46 AM Xintong Song 
> > wrote:
> > >
> > > Thanks for bringing this up, Gyula.
> > >
> > > The proposed changes make sense to me. +1 for them.
> > >
> > > In addition to the proposed changes, I wonder if we should also add
> > > something like timePerGc? This would help understand whether there are
> > long
> > > pauses, due to GC STW, that may lead to rpc unresponsiveness and
> > heartbeat
> > > timeouts. Ideally, we'd like to understand the max pause time per STW
> in
> > a
> > > recent time window. However, I don't see an easy way to separate the
> > pause
> > > time of each STW. Deriving the overall time per GC from the existing
> > > metrics (time-increment / count-increment) seems to be a good
> > alternative.
> > > WDYT?
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <1996fan...@gmail.com> wrote:
> > >
> > > Thanks for the clarification!
> >

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-06 Thread Xintong Song
>
> Just so I understand correctly, do you suggest adding a metric for
> delta(Time) / delta(Count) since the last reporting ?
> .TimePerGc or .AverageTime would make sense.
> AverageTime may be a bit nicer :)
>

Yes, that's what I mean.

My only concern is how useful this will be in reality. If there are only
> (or several) long pauses then the msPerSec metrics will show it already,
> and if there is a single long pause that may not be shown at all if there
> are several shorter pauses as well with this metric.


Let's say we measure this for every minute and see a 900 msPerSec (which
means 54s within the minute are spent on GC). This may come from a single
GC that lasts for 54s, or 2 GCs each lasting for ~27s, or more GCs with
less time each. As the default heartbeat timeout is 50s, the former means
there's likely a heartbeat timeout due to the GC pause, while the latter
means otherwise.


Mathematically, it is possible that there's 1 long pause together with
several short pauses within the same measurement period, making the long
pause not observable with AverageTime. However, from my experience, such a
pattern is not normal in reality. My observation is that GCs happen at a
similar time usually take a similar length of time. Admittedly, this is not
a hard guarantee.


Best,

Xintong



On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra  wrote:

> Matt Wang,
>
> I think the currently exposed info is all that is available through
> GarbageCollectorMXBean. This FLIP does not aim to introduce a new more
> granular way of reporting the per collector metrics, that would require a
> new mechanism and may be a breaking change.
>
> We basically want to simply extend the current reporting here with the rate
> metrics and the total metrics.
>
> Gyula
>
> On Wed, Sep 6, 2023 at 9:24 AM Matt Wang  wrote:
>
> > Hi Gyula,
> >
> > +1 for this proposal.
> >
> > Do we need to add a metric to record the count of different
> > collectors? Now there is only a total count. For example,
> > for G1, there is no way to distinguish whether it is the
> > young generation or the old generation.
> >
> >
> >
> > --
> >
> > Best,
> > Matt Wang
> >
> >
> >  Replied Message 
> > | From | Gyula Fóra |
> > | Date | 09/6/2023 15:03 |
> > | To |  |
> > | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics |
> > Thanks Xintong!
> >
> > Just so I understand correctly, do you suggest adding a metric for
> > delta(Time) / delta(Count) since the last reporting ?
> > .TimePerGc or .AverageTime would make sense.
> > AverageTime may be a bit nicer :)
> >
> > My only concern is how useful this will be in reality. If there are only
> > (or several) long pauses then the msPerSec metrics will show it already,
> > and if there is a single long pause that may not be shown at all if there
> > are several shorter pauses as well with this metric.
> >
> > Gyula
> >
> > On Wed, Sep 6, 2023 at 8:46 AM Xintong Song 
> wrote:
> >
> > Thanks for bringing this up, Gyula.
> >
> > The proposed changes make sense to me. +1 for them.
> >
> > In addition to the proposed changes, I wonder if we should also add
> > something like timePerGc? This would help understand whether there are
> long
> > pauses, due to GC STW, that may lead to rpc unresponsiveness and
> heartbeat
> > timeouts. Ideally, we'd like to understand the max pause time per STW in
> a
> > recent time window. However, I don't see an easy way to separate the
> pause
> > time of each STW. Deriving the overall time per GC from the existing
> > metrics (time-increment / count-increment) seems to be a good
> alternative.
> > WDYT?
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > Thanks for the clarification!
> >
> > By default the meterview measures for 1 minute sounds good to me!
> >
> > +1 for this proposal.
> >
> > Best,
> > Rui
> >
> > On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra  wrote:
> >
> > Thanks for the feedback Rui,
> >
> > The rates would be computed using the MeterView class (like for any
> > other
> > rate metric), just because we report the value per second it doesn't
> > mean
> > that we measure in a second granularity.
> > By default the meterview measures for 1 minute and then we calculate
> > the
> > per second rates, but we can increase the timespan if necessary.
> >
> > So I don't think we run into th

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-06 Thread Gyula Fóra
Matt Wang,

I think the currently exposed info is all that is available through
GarbageCollectorMXBean. This FLIP does not aim to introduce a new more
granular way of reporting the per collector metrics, that would require a
new mechanism and may be a breaking change.

We basically want to simply extend the current reporting here with the rate
metrics and the total metrics.

Gyula

On Wed, Sep 6, 2023 at 9:24 AM Matt Wang  wrote:

> Hi Gyula,
>
> +1 for this proposal.
>
> Do we need to add a metric to record the count of different
> collectors? Now there is only a total count. For example,
> for G1, there is no way to distinguish whether it is the
> young generation or the old generation.
>
>
>
> --
>
> Best,
> Matt Wang
>
>
>  Replied Message 
> | From | Gyula Fóra |
> | Date | 09/6/2023 15:03 |
> | To |  |
> | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics |
> Thanks Xintong!
>
> Just so I understand correctly, do you suggest adding a metric for
> delta(Time) / delta(Count) since the last reporting ?
> .TimePerGc or .AverageTime would make sense.
> AverageTime may be a bit nicer :)
>
> My only concern is how useful this will be in reality. If there are only
> (or several) long pauses then the msPerSec metrics will show it already,
> and if there is a single long pause that may not be shown at all if there
> are several shorter pauses as well with this metric.
>
> Gyula
>
> On Wed, Sep 6, 2023 at 8:46 AM Xintong Song  wrote:
>
> Thanks for bringing this up, Gyula.
>
> The proposed changes make sense to me. +1 for them.
>
> In addition to the proposed changes, I wonder if we should also add
> something like timePerGc? This would help understand whether there are long
> pauses, due to GC STW, that may lead to rpc unresponsiveness and heartbeat
> timeouts. Ideally, we'd like to understand the max pause time per STW in a
> recent time window. However, I don't see an easy way to separate the pause
> time of each STW. Deriving the overall time per GC from the existing
> metrics (time-increment / count-increment) seems to be a good alternative.
> WDYT?
>
> Best,
>
> Xintong
>
>
>
> On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> Thanks for the clarification!
>
> By default the meterview measures for 1 minute sounds good to me!
>
> +1 for this proposal.
>
> Best,
> Rui
>
> On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra  wrote:
>
> Thanks for the feedback Rui,
>
> The rates would be computed using the MeterView class (like for any
> other
> rate metric), just because we report the value per second it doesn't
> mean
> that we measure in a second granularity.
> By default the meterview measures for 1 minute and then we calculate
> the
> per second rates, but we can increase the timespan if necessary.
>
> So I don't think we run into this problem in practice and we can keep
> the
> metric aligned with other time rate metrics like busyTimeMsPerSec etc.
>
> Cheers,
> Gyula
>
> On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> Hi Gyula,
>
> +1 for this proposal. The current GC metric is really unfriendly.
>
> I have a concern with your proposed rate metric: the rate is
> perSecond
> instead of per minute. I'm unsure whether it's suitable for GC
> metric.
>
> There are two reasons why I suspect perSecond may not be well
> compatible with GC metric:
>
> 1. GCs are usually infrequent and may only occur for a small number
> of time periods within a minute.
>
> Metrics are collected periodically, for example, reported every
> minute.
> If the result reported by the GC metric is 1s/perSecond, it does not
> mean that the GC of the TM is serious, because there may be no GC
> in the remaining 59s.
>
> On the contrary, the GC metric reports 0s/perSecond, which does not
> mean that the GC of the TM is not serious, and the GC may be very
> serious in the remaining 59s.
>
> 2. Stop-the-world may cause the metric to fail(delay) to report
>
> The TM will stop the world during GC, especially full GC. It means
> the metric cannot be collected or reported during full GC.
>
> So the collected GC metric may never be 1s/perSecond. This metric
> may always be good because the metric will only be reported when
> the GC is not severe.
>
>
> If these concerns make sense, how about updating the GC rate
> at minute level?
>
> We can define the type to Gauge for TimeMsPerMiunte, and updating
> this Gauge every second, it is:
> GC Total.Time of current time - GC total time of one miunte ago.
>
> Best,
> Rui
>
> On Tue, Sep 5, 2023 at 11:05 PM Maximi

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-06 Thread liu ron
Hi, Gyula

Thanks for driving this proposal, GC-related metrics are beneficial for us
to profile the root cause, +1 for this proposal.

Best,
Ron

Matt Wang  于2023年9月6日周三 15:24写道:

> Hi Gyula,
>
> +1 for this proposal.
>
> Do we need to add a metric to record the count of different
> collectors? Now there is only a total count. For example,
> for G1, there is no way to distinguish whether it is the
> young generation or the old generation.
>
>
>
> --
>
> Best,
> Matt Wang
>
>
>  Replied Message 
> | From | Gyula Fóra |
> | Date | 09/6/2023 15:03 |
> | To |  |
> | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics |
> Thanks Xintong!
>
> Just so I understand correctly, do you suggest adding a metric for
> delta(Time) / delta(Count) since the last reporting ?
> .TimePerGc or .AverageTime would make sense.
> AverageTime may be a bit nicer :)
>
> My only concern is how useful this will be in reality. If there are only
> (or several) long pauses then the msPerSec metrics will show it already,
> and if there is a single long pause that may not be shown at all if there
> are several shorter pauses as well with this metric.
>
> Gyula
>
> On Wed, Sep 6, 2023 at 8:46 AM Xintong Song  wrote:
>
> Thanks for bringing this up, Gyula.
>
> The proposed changes make sense to me. +1 for them.
>
> In addition to the proposed changes, I wonder if we should also add
> something like timePerGc? This would help understand whether there are long
> pauses, due to GC STW, that may lead to rpc unresponsiveness and heartbeat
> timeouts. Ideally, we'd like to understand the max pause time per STW in a
> recent time window. However, I don't see an easy way to separate the pause
> time of each STW. Deriving the overall time per GC from the existing
> metrics (time-increment / count-increment) seems to be a good alternative.
> WDYT?
>
> Best,
>
> Xintong
>
>
>
> On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> Thanks for the clarification!
>
> By default the meterview measures for 1 minute sounds good to me!
>
> +1 for this proposal.
>
> Best,
> Rui
>
> On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra  wrote:
>
> Thanks for the feedback Rui,
>
> The rates would be computed using the MeterView class (like for any
> other
> rate metric), just because we report the value per second it doesn't
> mean
> that we measure in a second granularity.
> By default the meterview measures for 1 minute and then we calculate
> the
> per second rates, but we can increase the timespan if necessary.
>
> So I don't think we run into this problem in practice and we can keep
> the
> metric aligned with other time rate metrics like busyTimeMsPerSec etc.
>
> Cheers,
> Gyula
>
> On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> Hi Gyula,
>
> +1 for this proposal. The current GC metric is really unfriendly.
>
> I have a concern with your proposed rate metric: the rate is
> perSecond
> instead of per minute. I'm unsure whether it's suitable for GC
> metric.
>
> There are two reasons why I suspect perSecond may not be well
> compatible with GC metric:
>
> 1. GCs are usually infrequent and may only occur for a small number
> of time periods within a minute.
>
> Metrics are collected periodically, for example, reported every
> minute.
> If the result reported by the GC metric is 1s/perSecond, it does not
> mean that the GC of the TM is serious, because there may be no GC
> in the remaining 59s.
>
> On the contrary, the GC metric reports 0s/perSecond, which does not
> mean that the GC of the TM is not serious, and the GC may be very
> serious in the remaining 59s.
>
> 2. Stop-the-world may cause the metric to fail(delay) to report
>
> The TM will stop the world during GC, especially full GC. It means
> the metric cannot be collected or reported during full GC.
>
> So the collected GC metric may never be 1s/perSecond. This metric
> may always be good because the metric will only be reported when
> the GC is not severe.
>
>
> If these concerns make sense, how about updating the GC rate
> at minute level?
>
> We can define the type to Gauge for TimeMsPerMiunte, and updating
> this Gauge every second, it is:
> GC Total.Time of current time - GC total time of one miunte ago.
>
> Best,
> Rui
>
> On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels 
> wrote:
>
> Hi Gyula,
>
> +1 The proposed changes make sense and are in line with what is
> available for other metrics, e.g. number of records processed.
>
> -Max
>
> On Tue, Sep 5, 2023 at 2:43 PM Gyula 

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-06 Thread Matt Wang
Hi Gyula,

+1 for this proposal.

Do we need to add a metric to record the count of different 
collectors? Now there is only a total count. For example, 
for G1, there is no way to distinguish whether it is the 
young generation or the old generation.



--

Best,
Matt Wang


 Replied Message 
| From | Gyula Fóra |
| Date | 09/6/2023 15:03 |
| To |  |
| Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics |
Thanks Xintong!

Just so I understand correctly, do you suggest adding a metric for
delta(Time) / delta(Count) since the last reporting ?
.TimePerGc or .AverageTime would make sense.
AverageTime may be a bit nicer :)

My only concern is how useful this will be in reality. If there are only
(or several) long pauses then the msPerSec metrics will show it already,
and if there is a single long pause that may not be shown at all if there
are several shorter pauses as well with this metric.

Gyula

On Wed, Sep 6, 2023 at 8:46 AM Xintong Song  wrote:

Thanks for bringing this up, Gyula.

The proposed changes make sense to me. +1 for them.

In addition to the proposed changes, I wonder if we should also add
something like timePerGc? This would help understand whether there are long
pauses, due to GC STW, that may lead to rpc unresponsiveness and heartbeat
timeouts. Ideally, we'd like to understand the max pause time per STW in a
recent time window. However, I don't see an easy way to separate the pause
time of each STW. Deriving the overall time per GC from the existing
metrics (time-increment / count-increment) seems to be a good alternative.
WDYT?

Best,

Xintong



On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <1996fan...@gmail.com> wrote:

Thanks for the clarification!

By default the meterview measures for 1 minute sounds good to me!

+1 for this proposal.

Best,
Rui

On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra  wrote:

Thanks for the feedback Rui,

The rates would be computed using the MeterView class (like for any
other
rate metric), just because we report the value per second it doesn't
mean
that we measure in a second granularity.
By default the meterview measures for 1 minute and then we calculate
the
per second rates, but we can increase the timespan if necessary.

So I don't think we run into this problem in practice and we can keep
the
metric aligned with other time rate metrics like busyTimeMsPerSec etc.

Cheers,
Gyula

On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <1996fan...@gmail.com> wrote:

Hi Gyula,

+1 for this proposal. The current GC metric is really unfriendly.

I have a concern with your proposed rate metric: the rate is
perSecond
instead of per minute. I'm unsure whether it's suitable for GC
metric.

There are two reasons why I suspect perSecond may not be well
compatible with GC metric:

1. GCs are usually infrequent and may only occur for a small number
of time periods within a minute.

Metrics are collected periodically, for example, reported every
minute.
If the result reported by the GC metric is 1s/perSecond, it does not
mean that the GC of the TM is serious, because there may be no GC
in the remaining 59s.

On the contrary, the GC metric reports 0s/perSecond, which does not
mean that the GC of the TM is not serious, and the GC may be very
serious in the remaining 59s.

2. Stop-the-world may cause the metric to fail(delay) to report

The TM will stop the world during GC, especially full GC. It means
the metric cannot be collected or reported during full GC.

So the collected GC metric may never be 1s/perSecond. This metric
may always be good because the metric will only be reported when
the GC is not severe.


If these concerns make sense, how about updating the GC rate
at minute level?

We can define the type to Gauge for TimeMsPerMiunte, and updating
this Gauge every second, it is:
GC Total.Time of current time - GC total time of one miunte ago.

Best,
Rui

On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels 
wrote:

Hi Gyula,

+1 The proposed changes make sense and are in line with what is
available for other metrics, e.g. number of records processed.

-Max

On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra 
wrote:

Hi Devs,

I would like to start a discussion on FLIP-361: Improve GC
Metrics
[1].

The current Flink GC metrics [2] are not very useful for
monitoring
purposes as they require post processing logic that is also
dependent
on
the current runtime environment.

Problems:
- Total time is not very relevant for long running applications,
only
the
rate of change (msPerSec)
- In most cases it's best to simply aggregate the time/count
across
the
different GabrageCollectors, however the specific collectors are
dependent
on the current Java runtime

We propose to improve the current situation by:
- Exposing rate metrics per GarbageCollector
- Exposing aggregated Total time/count/rate metrics

These new metrics are all derived from the existing ones with
minimal
overhead.

Looking forward to your feedback.

Cheers,
Gyula

[1]





https://cwiki.

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-06 Thread Gyula Fóra
Thanks Xintong!

Just so I understand correctly, do you suggest adding a metric for
delta(Time) / delta(Count) since the last reporting ?
.TimePerGc or .AverageTime would make sense.
AverageTime may be a bit nicer :)

My only concern is how useful this will be in reality. If there are only
(or several) long pauses then the msPerSec metrics will show it already,
and if there is a single long pause that may not be shown at all if there
are several shorter pauses as well with this metric.

Gyula

On Wed, Sep 6, 2023 at 8:46 AM Xintong Song  wrote:

> Thanks for bringing this up, Gyula.
>
> The proposed changes make sense to me. +1 for them.
>
> In addition to the proposed changes, I wonder if we should also add
> something like timePerGc? This would help understand whether there are long
> pauses, due to GC STW, that may lead to rpc unresponsiveness and heartbeat
> timeouts. Ideally, we'd like to understand the max pause time per STW in a
> recent time window. However, I don't see an easy way to separate the pause
> time of each STW. Deriving the overall time per GC from the existing
> metrics (time-increment / count-increment) seems to be a good alternative.
> WDYT?
>
> Best,
>
> Xintong
>
>
>
> On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Thanks for the clarification!
> >
> > By default the meterview measures for 1 minute sounds good to me!
> >
> > +1 for this proposal.
> >
> > Best,
> > Rui
> >
> > On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra  wrote:
> >
> > > Thanks for the feedback Rui,
> > >
> > > The rates would be computed using the MeterView class (like for any
> other
> > > rate metric), just because we report the value per second it doesn't
> mean
> > > that we measure in a second granularity.
> > > By default the meterview measures for 1 minute and then we calculate
> the
> > > per second rates, but we can increase the timespan if necessary.
> > >
> > > So I don't think we run into this problem in practice and we can keep
> the
> > > metric aligned with other time rate metrics like busyTimeMsPerSec etc.
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <1996fan...@gmail.com> wrote:
> > >
> > > > Hi Gyula,
> > > >
> > > > +1 for this proposal. The current GC metric is really unfriendly.
> > > >
> > > > I have a concern with your proposed rate metric: the rate is
> perSecond
> > > > instead of per minute. I'm unsure whether it's suitable for GC
> metric.
> > > >
> > > > There are two reasons why I suspect perSecond may not be well
> > > > compatible with GC metric:
> > > >
> > > > 1. GCs are usually infrequent and may only occur for a small number
> > > > of time periods within a minute.
> > > >
> > > > Metrics are collected periodically, for example, reported every
> minute.
> > > > If the result reported by the GC metric is 1s/perSecond, it does not
> > > > mean that the GC of the TM is serious, because there may be no GC
> > > > in the remaining 59s.
> > > >
> > > > On the contrary, the GC metric reports 0s/perSecond, which does not
> > > > mean that the GC of the TM is not serious, and the GC may be very
> > > > serious in the remaining 59s.
> > > >
> > > > 2. Stop-the-world may cause the metric to fail(delay) to report
> > > >
> > > > The TM will stop the world during GC, especially full GC. It means
> > > > the metric cannot be collected or reported during full GC.
> > > >
> > > > So the collected GC metric may never be 1s/perSecond. This metric
> > > > may always be good because the metric will only be reported when
> > > > the GC is not severe.
> > > >
> > > >
> > > > If these concerns make sense, how about updating the GC rate
> > > > at minute level?
> > > >
> > > > We can define the type to Gauge for TimeMsPerMiunte, and updating
> > > > this Gauge every second, it is:
> > > > GC Total.Time of current time - GC total time of one miunte ago.
> > > >
> > > > Best,
> > > > Rui
> > > >
> > > > On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels 
> > > wrote:
> > > >
> > > > > Hi Gyula,
> > > > >
> > > > > +1 The proposed changes make sense and are in line with what is
> > > > > available for other metrics, e.g. number of records processed.
> > > > >
> > > > > -Max
> > > > >
> > > > > On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra 
> > > wrote:
> > > > > >
> > > > > > Hi Devs,
> > > > > >
> > > > > > I would like to start a discussion on FLIP-361: Improve GC
> Metrics
> > > [1].
> > > > > >
> > > > > > The current Flink GC metrics [2] are not very useful for
> monitoring
> > > > > > purposes as they require post processing logic that is also
> > dependent
> > > > on
> > > > > > the current runtime environment.
> > > > > >
> > > > > > Problems:
> > > > > >  - Total time is not very relevant for long running applications,
> > > only
> > > > > the
> > > > > > rate of change (msPerSec)
> > > > > >  - In most cases it's best to simply aggregate the time/count
> > across
> > > > the
> > > > > > different GabrageCollectors, however the specific co

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-05 Thread Xintong Song
Thanks for bringing this up, Gyula.

The proposed changes make sense to me. +1 for them.

In addition to the proposed changes, I wonder if we should also add
something like timePerGc? This would help understand whether there are long
pauses, due to GC STW, that may lead to rpc unresponsiveness and heartbeat
timeouts. Ideally, we'd like to understand the max pause time per STW in a
recent time window. However, I don't see an easy way to separate the pause
time of each STW. Deriving the overall time per GC from the existing
metrics (time-increment / count-increment) seems to be a good alternative.
WDYT?

Best,

Xintong



On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <1996fan...@gmail.com> wrote:

> Thanks for the clarification!
>
> By default the meterview measures for 1 minute sounds good to me!
>
> +1 for this proposal.
>
> Best,
> Rui
>
> On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra  wrote:
>
> > Thanks for the feedback Rui,
> >
> > The rates would be computed using the MeterView class (like for any other
> > rate metric), just because we report the value per second it doesn't mean
> > that we measure in a second granularity.
> > By default the meterview measures for 1 minute and then we calculate the
> > per second rates, but we can increase the timespan if necessary.
> >
> > So I don't think we run into this problem in practice and we can keep the
> > metric aligned with other time rate metrics like busyTimeMsPerSec etc.
> >
> > Cheers,
> > Gyula
> >
> > On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > Hi Gyula,
> > >
> > > +1 for this proposal. The current GC metric is really unfriendly.
> > >
> > > I have a concern with your proposed rate metric: the rate is perSecond
> > > instead of per minute. I'm unsure whether it's suitable for GC metric.
> > >
> > > There are two reasons why I suspect perSecond may not be well
> > > compatible with GC metric:
> > >
> > > 1. GCs are usually infrequent and may only occur for a small number
> > > of time periods within a minute.
> > >
> > > Metrics are collected periodically, for example, reported every minute.
> > > If the result reported by the GC metric is 1s/perSecond, it does not
> > > mean that the GC of the TM is serious, because there may be no GC
> > > in the remaining 59s.
> > >
> > > On the contrary, the GC metric reports 0s/perSecond, which does not
> > > mean that the GC of the TM is not serious, and the GC may be very
> > > serious in the remaining 59s.
> > >
> > > 2. Stop-the-world may cause the metric to fail(delay) to report
> > >
> > > The TM will stop the world during GC, especially full GC. It means
> > > the metric cannot be collected or reported during full GC.
> > >
> > > So the collected GC metric may never be 1s/perSecond. This metric
> > > may always be good because the metric will only be reported when
> > > the GC is not severe.
> > >
> > >
> > > If these concerns make sense, how about updating the GC rate
> > > at minute level?
> > >
> > > We can define the type to Gauge for TimeMsPerMiunte, and updating
> > > this Gauge every second, it is:
> > > GC Total.Time of current time - GC total time of one miunte ago.
> > >
> > > Best,
> > > Rui
> > >
> > > On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels 
> > wrote:
> > >
> > > > Hi Gyula,
> > > >
> > > > +1 The proposed changes make sense and are in line with what is
> > > > available for other metrics, e.g. number of records processed.
> > > >
> > > > -Max
> > > >
> > > > On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra 
> > wrote:
> > > > >
> > > > > Hi Devs,
> > > > >
> > > > > I would like to start a discussion on FLIP-361: Improve GC Metrics
> > [1].
> > > > >
> > > > > The current Flink GC metrics [2] are not very useful for monitoring
> > > > > purposes as they require post processing logic that is also
> dependent
> > > on
> > > > > the current runtime environment.
> > > > >
> > > > > Problems:
> > > > >  - Total time is not very relevant for long running applications,
> > only
> > > > the
> > > > > rate of change (msPerSec)
> > > > >  - In most cases it's best to simply aggregate the time/count
> across
> > > the
> > > > > different GabrageCollectors, however the specific collectors are
> > > > dependent
> > > > > on the current Java runtime
> > > > >
> > > > > We propose to improve the current situation by:
> > > > >  - Exposing rate metrics per GarbageCollector
> > > > >  - Exposing aggregated Total time/count/rate metrics
> > > > >
> > > > > These new metrics are all derived from the existing ones with
> minimal
> > > > > overhead.
> > > > >
> > > > > Looking forward to your feedback.
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
> > > > > [2]
> > > > >
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-05 Thread Rui Fan
Thanks for the clarification!

By default the meterview measures for 1 minute sounds good to me!

+1 for this proposal.

Best,
Rui

On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra  wrote:

> Thanks for the feedback Rui,
>
> The rates would be computed using the MeterView class (like for any other
> rate metric), just because we report the value per second it doesn't mean
> that we measure in a second granularity.
> By default the meterview measures for 1 minute and then we calculate the
> per second rates, but we can increase the timespan if necessary.
>
> So I don't think we run into this problem in practice and we can keep the
> metric aligned with other time rate metrics like busyTimeMsPerSec etc.
>
> Cheers,
> Gyula
>
> On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi Gyula,
> >
> > +1 for this proposal. The current GC metric is really unfriendly.
> >
> > I have a concern with your proposed rate metric: the rate is perSecond
> > instead of per minute. I'm unsure whether it's suitable for GC metric.
> >
> > There are two reasons why I suspect perSecond may not be well
> > compatible with GC metric:
> >
> > 1. GCs are usually infrequent and may only occur for a small number
> > of time periods within a minute.
> >
> > Metrics are collected periodically, for example, reported every minute.
> > If the result reported by the GC metric is 1s/perSecond, it does not
> > mean that the GC of the TM is serious, because there may be no GC
> > in the remaining 59s.
> >
> > On the contrary, the GC metric reports 0s/perSecond, which does not
> > mean that the GC of the TM is not serious, and the GC may be very
> > serious in the remaining 59s.
> >
> > 2. Stop-the-world may cause the metric to fail(delay) to report
> >
> > The TM will stop the world during GC, especially full GC. It means
> > the metric cannot be collected or reported during full GC.
> >
> > So the collected GC metric may never be 1s/perSecond. This metric
> > may always be good because the metric will only be reported when
> > the GC is not severe.
> >
> >
> > If these concerns make sense, how about updating the GC rate
> > at minute level?
> >
> > We can define the type to Gauge for TimeMsPerMiunte, and updating
> > this Gauge every second, it is:
> > GC Total.Time of current time - GC total time of one miunte ago.
> >
> > Best,
> > Rui
> >
> > On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels 
> wrote:
> >
> > > Hi Gyula,
> > >
> > > +1 The proposed changes make sense and are in line with what is
> > > available for other metrics, e.g. number of records processed.
> > >
> > > -Max
> > >
> > > On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra 
> wrote:
> > > >
> > > > Hi Devs,
> > > >
> > > > I would like to start a discussion on FLIP-361: Improve GC Metrics
> [1].
> > > >
> > > > The current Flink GC metrics [2] are not very useful for monitoring
> > > > purposes as they require post processing logic that is also dependent
> > on
> > > > the current runtime environment.
> > > >
> > > > Problems:
> > > >  - Total time is not very relevant for long running applications,
> only
> > > the
> > > > rate of change (msPerSec)
> > > >  - In most cases it's best to simply aggregate the time/count across
> > the
> > > > different GabrageCollectors, however the specific collectors are
> > > dependent
> > > > on the current Java runtime
> > > >
> > > > We propose to improve the current situation by:
> > > >  - Exposing rate metrics per GarbageCollector
> > > >  - Exposing aggregated Total time/count/rate metrics
> > > >
> > > > These new metrics are all derived from the existing ones with minimal
> > > > overhead.
> > > >
> > > > Looking forward to your feedback.
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
> > > > [2]
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection
> > >
> >
>


Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-05 Thread Gyula Fóra
Thanks for the feedback Rui,

The rates would be computed using the MeterView class (like for any other
rate metric), just because we report the value per second it doesn't mean
that we measure in a second granularity.
By default the meterview measures for 1 minute and then we calculate the
per second rates, but we can increase the timespan if necessary.

So I don't think we run into this problem in practice and we can keep the
metric aligned with other time rate metrics like busyTimeMsPerSec etc.

Cheers,
Gyula

On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <1996fan...@gmail.com> wrote:

> Hi Gyula,
>
> +1 for this proposal. The current GC metric is really unfriendly.
>
> I have a concern with your proposed rate metric: the rate is perSecond
> instead of per minute. I'm unsure whether it's suitable for GC metric.
>
> There are two reasons why I suspect perSecond may not be well
> compatible with GC metric:
>
> 1. GCs are usually infrequent and may only occur for a small number
> of time periods within a minute.
>
> Metrics are collected periodically, for example, reported every minute.
> If the result reported by the GC metric is 1s/perSecond, it does not
> mean that the GC of the TM is serious, because there may be no GC
> in the remaining 59s.
>
> On the contrary, the GC metric reports 0s/perSecond, which does not
> mean that the GC of the TM is not serious, and the GC may be very
> serious in the remaining 59s.
>
> 2. Stop-the-world may cause the metric to fail(delay) to report
>
> The TM will stop the world during GC, especially full GC. It means
> the metric cannot be collected or reported during full GC.
>
> So the collected GC metric may never be 1s/perSecond. This metric
> may always be good because the metric will only be reported when
> the GC is not severe.
>
>
> If these concerns make sense, how about updating the GC rate
> at minute level?
>
> We can define the type to Gauge for TimeMsPerMiunte, and updating
> this Gauge every second, it is:
> GC Total.Time of current time - GC total time of one miunte ago.
>
> Best,
> Rui
>
> On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels  wrote:
>
> > Hi Gyula,
> >
> > +1 The proposed changes make sense and are in line with what is
> > available for other metrics, e.g. number of records processed.
> >
> > -Max
> >
> > On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra  wrote:
> > >
> > > Hi Devs,
> > >
> > > I would like to start a discussion on FLIP-361: Improve GC Metrics [1].
> > >
> > > The current Flink GC metrics [2] are not very useful for monitoring
> > > purposes as they require post processing logic that is also dependent
> on
> > > the current runtime environment.
> > >
> > > Problems:
> > >  - Total time is not very relevant for long running applications, only
> > the
> > > rate of change (msPerSec)
> > >  - In most cases it's best to simply aggregate the time/count across
> the
> > > different GabrageCollectors, however the specific collectors are
> > dependent
> > > on the current Java runtime
> > >
> > > We propose to improve the current situation by:
> > >  - Exposing rate metrics per GarbageCollector
> > >  - Exposing aggregated Total time/count/rate metrics
> > >
> > > These new metrics are all derived from the existing ones with minimal
> > > overhead.
> > >
> > > Looking forward to your feedback.
> > >
> > > Cheers,
> > > Gyula
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
> > > [2]
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection
> >
>


Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-05 Thread Rui Fan
Hi Gyula,

+1 for this proposal. The current GC metric is really unfriendly.

I have a concern with your proposed rate metric: the rate is perSecond
instead of per minute. I'm unsure whether it's suitable for GC metric.

There are two reasons why I suspect perSecond may not be well
compatible with GC metric:

1. GCs are usually infrequent and may only occur for a small number
of time periods within a minute.

Metrics are collected periodically, for example, reported every minute.
If the result reported by the GC metric is 1s/perSecond, it does not
mean that the GC of the TM is serious, because there may be no GC
in the remaining 59s.

On the contrary, the GC metric reports 0s/perSecond, which does not
mean that the GC of the TM is not serious, and the GC may be very
serious in the remaining 59s.

2. Stop-the-world may cause the metric to fail(delay) to report

The TM will stop the world during GC, especially full GC. It means
the metric cannot be collected or reported during full GC.

So the collected GC metric may never be 1s/perSecond. This metric
may always be good because the metric will only be reported when
the GC is not severe.


If these concerns make sense, how about updating the GC rate
at minute level?

We can define the type to Gauge for TimeMsPerMiunte, and updating
this Gauge every second, it is:
GC Total.Time of current time - GC total time of one miunte ago.

Best,
Rui

On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels  wrote:

> Hi Gyula,
>
> +1 The proposed changes make sense and are in line with what is
> available for other metrics, e.g. number of records processed.
>
> -Max
>
> On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra  wrote:
> >
> > Hi Devs,
> >
> > I would like to start a discussion on FLIP-361: Improve GC Metrics [1].
> >
> > The current Flink GC metrics [2] are not very useful for monitoring
> > purposes as they require post processing logic that is also dependent on
> > the current runtime environment.
> >
> > Problems:
> >  - Total time is not very relevant for long running applications, only
> the
> > rate of change (msPerSec)
> >  - In most cases it's best to simply aggregate the time/count across the
> > different GabrageCollectors, however the specific collectors are
> dependent
> > on the current Java runtime
> >
> > We propose to improve the current situation by:
> >  - Exposing rate metrics per GarbageCollector
> >  - Exposing aggregated Total time/count/rate metrics
> >
> > These new metrics are all derived from the existing ones with minimal
> > overhead.
> >
> > Looking forward to your feedback.
> >
> > Cheers,
> > Gyula
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
> > [2]
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection
>


Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-05 Thread Maximilian Michels
Hi Gyula,

+1 The proposed changes make sense and are in line with what is
available for other metrics, e.g. number of records processed.

-Max

On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra  wrote:
>
> Hi Devs,
>
> I would like to start a discussion on FLIP-361: Improve GC Metrics [1].
>
> The current Flink GC metrics [2] are not very useful for monitoring
> purposes as they require post processing logic that is also dependent on
> the current runtime environment.
>
> Problems:
>  - Total time is not very relevant for long running applications, only the
> rate of change (msPerSec)
>  - In most cases it's best to simply aggregate the time/count across the
> different GabrageCollectors, however the specific collectors are dependent
> on the current Java runtime
>
> We propose to improve the current situation by:
>  - Exposing rate metrics per GarbageCollector
>  - Exposing aggregated Total time/count/rate metrics
>
> These new metrics are all derived from the existing ones with minimal
> overhead.
>
> Looking forward to your feedback.
>
> Cheers,
> Gyula
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
> [2]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection


[DISCUSS] FLIP-361: Improve GC Metrics

2023-09-05 Thread Gyula Fóra
Hi Devs,

I would like to start a discussion on FLIP-361: Improve GC Metrics [1].

The current Flink GC metrics [2] are not very useful for monitoring
purposes as they require post processing logic that is also dependent on
the current runtime environment.

Problems:
 - Total time is not very relevant for long running applications, only the
rate of change (msPerSec)
 - In most cases it's best to simply aggregate the time/count across the
different GabrageCollectors, however the specific collectors are dependent
on the current Java runtime

We propose to improve the current situation by:
 - Exposing rate metrics per GarbageCollector
 - Exposing aggregated Total time/count/rate metrics

These new metrics are all derived from the existing ones with minimal
overhead.

Looking forward to your feedback.

Cheers,
Gyula

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection