Re: Facility to detect long STW pauses and other system response degradations

Дмитрий Сорокин Tue, 21 Nov 2017 02:17:25 -0800

Don't forget that the high utilization of CPU can occur for reasons other
than GC STW, and GC log parsing will not help us in that case.



вт, 21 нояб. 2017 г. в 13:06, Anton Vinogradov [via Apache Ignite
Developers] <ml+s2346864n24497...@n4.nabble.com>:

> Denis,
>
> > 1. Totally for a separate native process that will handle the monitoring
> of an Ignite process. The watchdog process can simply start a JVM tool
> like
> jstat and parse its GC logs: https://dzone.com/articles/
> how-monitor-java-garbage <https://dzone.com/articles/
> how-monitor-java-garbage>
> Different GC and even same GC at different OS/JVM produce different logs.
> That's not easy to parse them. But, since http://gceasy.io can do that,
> it
> looks to be possible, somehow :) .
> Do you know any libs or solutions allows to do this at realtime?
>
> > 2. As for the STW handling, I would make a possible reaction more
> generic. Let’s define a policy (enumeration) that will define how to deal
> with an unstable node. The events might be as follows - kill a node,
> restart a node, trigger a custom script using Runtime.exec or other
> methods.
> Yes, it should be similar to segmentation policy + custom script
> execution.
>
>
> On Tue, Nov 21, 2017 at 2:10 AM, Denis Magda <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=24497&i=0>> wrote:
>
> > My 2 cents.
> >
> > 1. Totally for a separate native process that will handle the monitoring
> > of an Ignite process. The watchdog process can simply start a JVM tool
> like
> > jstat and parse its GC logs: https://dzone.com/articles/
> > how-monitor-java-garbage <https://dzone.com/articles/
> > how-monitor-java-garbage>
> >
> > 2. As for the STW handling, I would make a possible reaction more
> generic.
> > Let’s define a policy (enumeration) that will define how to deal with an
> > unstable node. The events might be as follows - kill a node, restart a
> > node, trigger a custom script using Runtime.exec or other methods.
> >
> > What’d you think? Specifically on point 2.
> >
> > —
> > Denis
> >
> > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=24497&i=1>>
> > wrote:
> > >
> > > Yakov,
> > >
> > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > >
> > > We split issue to
> > > #1 STW duration metrics
> > > #2 External monitoring allows to stop node during STW
> > >
> > >> Testing GC pause with java thread is
> > >> a bit strange and can give info only after GC pause finishes.
> > >
> > > That's ok since it's #1
> > >
> > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > [hidden email] <http:///user/SendEmail.jtp?type=node&node=24497&i=2>>
> > > wrote:
> > >
> > >> I have tested solution with java-thread and GC logs had contain same
> > pause
> > >> values of thread stopping which was detected by java-thread.
> > >>
> > >>
> > >> My log (contains pauses > 100ms):
> > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long
> STW
> > >> pause: 507 milliseconds.
> > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long
> STW
> > >> pause: 5595 milliseconds.
> > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long
> STW
> > >> pause: 3262 milliseconds.
> > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long
> STW
> > >> pause: 1737 milliseconds.
> > >>
> > >> GC log:
> > >> gridgain@dell-5580-92zc8h2:~$ cat
> > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
> > >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> > 0,0000246
> > >> seconds
> > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
> > >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> > 0,0000252
> > >> seconds
> > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
> > >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> > 0,0000178
> > >> seconds    // GOT!
> > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
> > >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> > 0,0000229
> > >> seconds    // GOT!
> > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which
> application
> > >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> > 0,0000223
> > >> seconds    // GOT!
> > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which
> application
> > >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> > 0,0000121
> > >> seconds    // GOT!
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > >>
> >
> >
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-ignite-developers.2346864.n4.nabble.com/Facility-to-detect-long-STW-pauses-and-other-system-response-degradations-tp24391p24497.html
> To unsubscribe from Facility to detect long STW pauses and other system
> response degradations, click here
> <http://apache-ignite-developers.2346864.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=24391&code=c2J0LnNvcm9raW4uZHZsQGdtYWlsLmNvbXwyNDM5MXwtMjA0OTY3OTkxOQ==>
> .
> NAML
> <http://apache-ignite-developers.2346864.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>

Re: Facility to detect long STW pauses and other system response degradations

Reply via email to