Vova,

I'll refactor IEP-7 [1], most likely merge it with IEP-5 [2], and let you
know that overall design ready and clear :)

[1]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-7%3A+Ignite+internal+problems+detection#IEP-7:Igniteinternalproblemsdetection-SystemThreadRegestry
.
[2]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-5+Cluster+reaction+if+node+detects+an+extraordinary+situations

On Wed, Nov 15, 2017 at 3:21 PM, Vladimir Ozerov <voze...@gridgain.com>
wrote:

> It would be nice to see the whole design first before going into low-level
> details. Without it we are jumping from topic to topic. Were the list
> events and reaction to these events discussed previously? At this point it
> is not clear why nodes should be forcefully stopped without any
> alternative.
>
> For example, consider the following cases:
> 1) Exchange thread died. This is critical situation. But as a part of
> analysis administrator might want to dump threads before killing the node.
> He can do that programmatically, which is difficult and require knowledge
> of Java, or can do that through management utilities, such as jstack or
> VisualVM. What is more user friendly?
> 2) We start a service with multiple data regions. One data region is
> configured incorrectly, what causes IOOME on multiple nodes. Why do you
> think that the whole cluster (or many nodes) should be restarted? This is
> potential data loss in all caches (not only in affected) and interruption
> of service. Instead, administrator might decide to gradually reconfigure
> and restart nodes one by one, instead of killing them all immediately.
>
> This is why we need the design first.
>
> On Wed, Nov 15, 2017 at 2:39 PM, Anton Vinogradov <
> avinogra...@gridgain.com>
> wrote:
>
> > According to [1]
> >
> > Reasons are:
> > - IgniteOutOfMemoryException
> > - Persistence errors
> > - ExchangeWorker exits with error
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > 7%3A+Ignite+internal+problems+detection
> >
> > On Wed, Nov 15, 2017 at 2:24 PM, Vladimir Ozerov <voze...@gridgain.com>
> > wrote:
> >
> > > I am not quite I understand how tasks are split. How can we discuss
> > > graceful shutdown without discussing the reasons of this shutdown? What
> > > leads to it?
> > >
> > > On Wed, Nov 15, 2017 at 2:10 PM, Anton Vinogradov <
> > > avinogra...@gridgain.com>
> > > wrote:
> > >
> > > > Vova,
> > > >
> > > > Currently we have a lot IEPs to improve grid monitoring and behavior.
> > > >
> > > > Let's split tasks to:
> > > >
> > > > 1) Graceful shutdown.
> > > > In this case we'd like to provide user ability to do something,
> > > > LifecycleBean is what we looking for, thanks for tips!
> > > > But, we have to keep shutdown reason somewhere.
> > > > In case you know where it already kept , please let us know.
> > > >
> > > > 2) OOM or any other reason cause node crash.
> > > > In this case some watchdog (like [1] or [2]) should monitor node
> alive
> > > >
> > > > 3) GC and deadlock(java and tx) issues
> > > > Should be monitored by special thread [3] or published by metrics [4]
> > > >
> > > > 4) Throughput, latency and space issues
> > > > Special metrics should be developed according to [5]
> > > >
> > > > Andrey asking about case #1 (graceful shutdown), lets discuss only
> this
> > > > case.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-6587
> > > > [2] https://wrapper.tanukisoftware.com/doc/english/download.jsp
> > > > [3] https://issues.apache.org/jira/browse/IGNITE-6171
> > > > [4]
> > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > 7%3A+Ignite+internal+problems+detection
> > > > [5]
> > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > 6%3A+Metrics+improvements
> > > >
> > > >
> > > > On Wed, Nov 15, 2017 at 1:34 PM, Vladimir Ozerov <
> voze...@gridgain.com
> > >
> > > > wrote:
> > > >
> > > > > AFAIK the idea was not only to shutdown the node, but also to give
> > user
> > > > > (e.g. administrator) ability to observe the problem from the
> outside,
> > > > e.g.
> > > > > through JMX. E.g. if we detect Java-level deadlock, it doesn't mean
> > > that
> > > > > the only possible solution is node shutdown. In addition it could
> be
> > > > no-op,
> > > > > e.g. to give user chance to collect additional system info, or
> simply
> > > > > because this particular deadlock is resolvable (e.g.
> > > > > Lock.lockInterruptibly()). So as we need to expose health info
> > through
> > > > JMX
> > > > > anyway, we could also give user programmatic access to it as well.
> > > > > Alternatively, we can expose this info through JMX only and ask
> user
> > to
> > > > get
> > > > > instance of that bean manually.
> > > > >
> > > > > On Wed, Nov 15, 2017 at 1:19 PM, Anton Vinogradov <
> > > > > avinogra...@gridgain.com>
> > > > > wrote:
> > > > >
> > > > > > Vova,
> > > > > >
> > > > > > Could you point to metric you're talking about?
> > > > > >
> > > > > > On Wed, Nov 15, 2017 at 1:06 PM, Andrey Kuznetsov <
> > stku...@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Vladimir,
> > > > > > >
> > > > > > > Could you please refine, what are local metrics? Should I
> extend
> > > > Ignite
> > > > > > > interface by adding something similar to dataRegionMetrics() or
> > > there
> > > > > is
> > > > > > > some universal mechanism to handle metrics?
> > > > > > >
> > > > > > > 2017-11-15 8:30 GMT+03:00 Vladimir Ozerov <
> voze...@gridgain.com
> > >:
> > > > > > > >
> > > > > > > > This information should be available through local metrics,
> so
> > > that
> > > > > it
> > > > > > is
> > > > > > > > accessible from Ignite instance.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to