Vova,

1) We can gain collections info from GarbageCollectorMXBean
But it provides only
- collectionCount
- collectionTime.

This is very interesting metrics, but they tell us nothing about long STW.
Long STW means we have huge latency during STW and we should find the
reason and solve it.

2) So, we're working on new incremental metrics
- total amount of STW longer than XXX
- total duration of STW longer than XXX

which shows us JVM/GC health situation.

Is it answer to your question?

On Tue, Nov 21, 2017 at 9:05 PM, Vladimir Ozerov <[email protected]>
wrote:

> Anton,
>
> The question is why user may need so precise measurement? I share Andrey’s
> opinion - cannot understand the value.
>
> вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <[email protected]>:
>
> > Andrey,
> >
> > >  JVM provides sufficient means of detecting a struggling process out of
> > the box.
> >
> > Could you point to some articles describing how to detect STW exceeding
> > some duration using only JVM API?
> >
> > On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <[email protected]
> >
> > wrote:
> >
> > > My 2 cents. Don’t do it. JVM provides sufficient means of detecting a
> > > struggling process out of the box. SRE/Operations teams usually know
> how
> > to
> > > monitor JVMs and can handle killing of such processes themselves.
> > >
> > > The feature adds no value, just complexity (and more configuration
> > > parameters (!) — as if Ignite didn’t have enough of them already).
> > >
> > > Regards,
> > > Andrey
> > > _____________________________
> > > From: Denis Magda <[email protected]>
> > > Sent: Monday, November 20, 2017 3:10 PM
> > > Subject: Re: Facility to detect long STW pauses and other system
> response
> > > degradations
> > > To: <[email protected]>
> > >
> > >
> > > My 2 cents.
> > >
> > > 1. Totally for a separate native process that will handle the
> monitoring
> > > of an Ignite process. The watchdog process can simply start a JVM tool
> > like
> > > jstat and parse its GC logs: https://dzone.com/articles/
> > > how-monitor-java-garbage <https://dzone.com/articles/
> > > how-monitor-java-garbage>
> > >
> > > 2. As for the STW handling, I would make a possible reaction more
> > generic.
> > > Let’s define a policy (enumeration) that will define how to deal with
> an
> > > unstable node. The events might be as follows - kill a node, restart a
> > > node, trigger a custom script using Runtime.exec or other methods.
> > >
> > > What’d you think? Specifically on point 2.
> > >
> > > —
> > > Denis
> > >
> > > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <
> > [email protected]>
> > > wrote:
> > > >
> > > > Yakov,
> > > >
> > > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > > >
> > > > We split issue to
> > > > #1 STW duration metrics
> > > > #2 External monitoring allows to stop node during STW
> > > >
> > > >> Testing GC pause with java thread is
> > > >> a bit strange and can give info only after GC pause finishes.
> > > >
> > > > That's ok since it's #1
> > > >
> > > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > > [email protected]>
> > > > wrote:
> > > >
> > > >> I have tested solution with java-thread and GC logs had contain same
> > > pause
> > > >> values of thread stopping which was detected by java-thread.
> > > >>
> > > >>
> > > >> My log (contains pauses > 100ms):
> > > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long
> STW
> > > >> pause: 507 milliseconds.
> > > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long
> STW
> > > >> pause: 5595 milliseconds.
> > > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long
> STW
> > > >> pause: 3262 milliseconds.
> > > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long
> STW
> > > >> pause: 1737 milliseconds.
> > > >>
> > > >> GC log:
> > > >> gridgain@dell-5580-92zc8h2:~$ cat
> > > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which
> application
> > > >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> > > 0,0000246
> > > >> seconds
> > > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which
> application
> > > >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> > > 0,0000252
> > > >> seconds
> > > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which
> application
> > > >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> > > 0,0000178
> > > >> seconds // GOT!
> > > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which
> application
> > > >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> > > 0,0000229
> > > >> seconds // GOT!
> > > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which
> application
> > > >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> > > 0,0000223
> > > >> seconds // GOT!
> > > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which
> application
> > > >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> > > 0,0000121
> > > >> seconds // GOT!
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > > >>
> > >
> > >
> > >
> > >
> >
>

Reply via email to