Re: Ignite Enhancement Proposal #7 (Internal problems detection)

Дмитрий Сорокин Wed, 29 Nov 2017 03:10:11 -0800

Vladimir,

At the moment policy looks like so:


/**
 * Policy that defines how node will process the failures. Note that default
 * failure processing policy is defined by {@link
IgniteConfiguration#DFLT_FLR_PLC} property.
 */
public enum FailureProcessingPolicy {
    /** Restart jvm. */
    RESTART_JVM,

    /** Stop. */
    STOP,

    /** Noop. */
    NOOP;
}

Can You give an example which different event (failure) types need
different reactions?
We expect that all failures when some ignite system worker (or other
critical component) will broken, need same policy for same node.


2017-11-29 13:56 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>:

> Dmitry,
>
> Thank you, but how FailureProcessingPolicy looks like? It is not clear how
> can I configure different reactions to different event types.
>
> On Wed, Nov 29, 2017 at 1:47 PM, Дмитрий Сорокин <
> sbt.sorokin....@gmail.com>
> wrote:
>
> > Vladimir,
> >
> > These policies (policy, in fact) can be configured in IgniteConfiguration
> > by calling setFailureProcessingPolicy(FailureProcessingPolicy flrPlc)
> > method.
> >
> > 2017-11-29 10:35 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>:
> >
> > > Denis,
> > >
> > > Yes, but can we look at proposed API before we dig into implementation?
> > >
> > > On Tue, Nov 28, 2017 at 9:43 PM, Denis Magda <dma...@apache.org>
> wrote:
> > >
> > > > I think the failure processing policy should be configured via
> > > > IgniteConfiguration in a way similar to the segmentation policies.
> > > >
> > > > —
> > > > Denis
> > > >
> > > > > On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov <
> voze...@gridgain.com>
> > > > wrote:
> > > > >
> > > > > Dmitry,
> > > > >
> > > > > How these policies will be configured? Do you have any API in mind?
> > > > >
> > > > > On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <dma...@apache.org>
> > > wrote:
> > > > >
> > > > >> No objections here. Additional policies like EXEC might be added
> > later
> > > > >> depending on user needs.
> > > > >>
> > > > >> —
> > > > >> Denis
> > > > >>
> > > > >>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <
> > > > sbt.sorokin....@gmail.com>
> > > > >> wrote:
> > > > >>>
> > > > >>> Denis,
> > > > >>> I propose start with first three policies (it's already
> > implemented,
> > > > just
> > > > >>> await some code combing, commit & review).
> > > > >>> About of fourth policy (EXEC) I think that it's rather additional
> > > > >> property
> > > > >>> (some script path) than policy.
> > > > >>>
> > > > >>> 2017-11-23 0:43 GMT+03:00 Denis Magda <dma...@apache.org>:
> > > > >>>
> > > > >>>> Just provide FailureProcessingPolicy with possible reactions:
> > > > >>>> - NOOP - exceptions will be reported, metrics will be triggered
> > but
> > > an
> > > > >>>> affected Ignite process won’t be touched.
> > > > >>>> - HAULT (or STOP or KILL) - all the actions of the of NOOP +
> > Ignite
> > > > >>>> process termination.
> > > > >>>> - RESTART - NOOP actions + process restart.
> > > > >>>> - EXEC - execute a custom script provided by the user.
> > > > >>>>
> > > > >>>> If needed the policy can be set per know failure such is OOM,
> > > > >> Persistence
> > > > >>>> errors so that the user can act accordingly basing on a context.
> > > > >>>>
> > > > >>>> —
> > > > >>>> Denis
> > > > >>>>
> > > > >>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <
> > > voze...@gridgain.com>
> > > > >>>> wrote:
> > > > >>>>>
> > > > >>>>> In the first iteration I would focus only on reporting
> > facilities,
> > > to
> > > > >> let
> > > > >>>>> administrator spot dangerous situation. And in the second
> phase,
> > > when
> > > > >> all
> > > > >>>>> reporting and metrics are ready, we can think on some automatic
> > > > >> actions.
> > > > >>>>>
> > > > >>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <
> > > > >>>> mcherka...@gridgain.com
> > > > >>>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Hi Anton,
> > > > >>>>>>
> > > > >>>>>> I don't think that we should shutdown node in case of
> > > > >>>> IgniteOOMException,
> > > > >>>>>> if one node has no space, then other probably  don't have it
> > too,
> > > so
> > > > >> re
> > > > >>>>>> -balancing will cause IgniteOOM on all other nodes and will
> kill
> > > the
> > > > >>>> whole
> > > > >>>>>> cluster. I think for some configurations cluster should
> survive
> > > and
> > > > >>>> allow
> > > > >>>>>> to user clean cache or/and add more nodes.
> > > > >>>>>>
> > > > >>>>>> Thanks,
> > > > >>>>>> Mikhail.
> > > > >>>>>>
> > > > >>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" <
> > > > >>>>>> avinogra...@gridgain.com> написал:
> > > > >>>>>>
> > > > >>>>>>> Igniters,
> > > > >>>>>>>
> > > > >>>>>>> Internal problems may and, unfortunately, cause unexpected
> > > cluster
> > > > >>>>>>> behavior.
> > > > >>>>>>> We should determine behavior in case any of internal problem
> > > > >> happened.
> > > > >>>>>>>
> > > > >>>>>>> Well known internal problems can be split to:
> > > > >>>>>>> 1) OOM or any other reason cause node crash
> > > > >>>>>>>
> > > > >>>>>>> 2) Situations required graceful node shutdown with custom
> > > > >> notification
> > > > >>>>>>> - IgniteOutOfMemoryException
> > > > >>>>>>> - Persistence errors
> > > > >>>>>>> - ExchangeWorker exits with error
> > > > >>>>>>>
> > > > >>>>>>> 3) Prefomance issues should be covered by metrics
> > > > >>>>>>> - GC STW duration
> > > > >>>>>>> - Timed out tasks and jobs
> > > > >>>>>>> - TX deadlock
> > > > >>>>>>> - Hanged Tx (waits for some service)
> > > > >>>>>>> - Java Deadlocks
> > > > >>>>>>>
> > > > >>>>>>> I created special issue [1] to make sure all these metrics
> will
> > > be
> > > > >>>>>>> presented at WebConsole or VisorConsole (what's preferred?)
> > > > >>>>>>>
> > > > >>>>>>> 4) Situations required external monitoring implementation
> > > > >>>>>>> - GC STW duration exceed maximum possible length (node should
> > be
> > > > >>>> stopped
> > > > >>>>>>> before STW finished)
> > > > >>>>>>>
> > > > >>>>>>> All this problems were reported by different persons
> different
> > > time
> > > > >>>> ago,
> > > > >>>>>>> So, we should reanalyze each of them and, possible, find
> better
> > > > ways
> > > > >> to
> > > > >>>>>>> solve them than it described at issues.
> > > > >>>>>>>
> > > > >>>>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to
> mention
> > > > >>>> something
> > > > >>>>>>> else :)
> > > > >>>>>>>
> > > > >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
> > > > >>>>>>> [2]
> > > > >>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > >>>>>>> 7%3A+Ignite+internal+problems+detection
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: Ignite Enhancement Proposal #7 (Internal problems detection)

Reply via email to