Re: Ignite Enhancement Proposal #7 (Internal problems detection)

Дмитрий Сорокин Wed, 29 Nov 2017 02:48:07 -0800

Vladimir,

These policies (policy, in fact) can be configured in IgniteConfiguration
by calling setFailureProcessingPolicy(FailureProcessingPolicy flrPlc)
method.


2017-11-29 10:35 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>:

> Denis,
>
> Yes, but can we look at proposed API before we dig into implementation?
>
> On Tue, Nov 28, 2017 at 9:43 PM, Denis Magda <dma...@apache.org> wrote:
>
> > I think the failure processing policy should be configured via
> > IgniteConfiguration in a way similar to the segmentation policies.
> >
> > —
> > Denis
> >
> > > On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov <voze...@gridgain.com>
> > wrote:
> > >
> > > Dmitry,
> > >
> > > How these policies will be configured? Do you have any API in mind?
> > >
> > > On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <dma...@apache.org>
> wrote:
> > >
> > >> No objections here. Additional policies like EXEC might be added later
> > >> depending on user needs.
> > >>
> > >> —
> > >> Denis
> > >>
> > >>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <
> > sbt.sorokin....@gmail.com>
> > >> wrote:
> > >>>
> > >>> Denis,
> > >>> I propose start with first three policies (it's already implemented,
> > just
> > >>> await some code combing, commit & review).
> > >>> About of fourth policy (EXEC) I think that it's rather additional
> > >> property
> > >>> (some script path) than policy.
> > >>>
> > >>> 2017-11-23 0:43 GMT+03:00 Denis Magda <dma...@apache.org>:
> > >>>
> > >>>> Just provide FailureProcessingPolicy with possible reactions:
> > >>>> - NOOP - exceptions will be reported, metrics will be triggered but
> an
> > >>>> affected Ignite process won’t be touched.
> > >>>> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite
> > >>>> process termination.
> > >>>> - RESTART - NOOP actions + process restart.
> > >>>> - EXEC - execute a custom script provided by the user.
> > >>>>
> > >>>> If needed the policy can be set per know failure such is OOM,
> > >> Persistence
> > >>>> errors so that the user can act accordingly basing on a context.
> > >>>>
> > >>>> —
> > >>>> Denis
> > >>>>
> > >>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <
> voze...@gridgain.com>
> > >>>> wrote:
> > >>>>>
> > >>>>> In the first iteration I would focus only on reporting facilities,
> to
> > >> let
> > >>>>> administrator spot dangerous situation. And in the second phase,
> when
> > >> all
> > >>>>> reporting and metrics are ready, we can think on some automatic
> > >> actions.
> > >>>>>
> > >>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <
> > >>>> mcherka...@gridgain.com
> > >>>>>> wrote:
> > >>>>>
> > >>>>>> Hi Anton,
> > >>>>>>
> > >>>>>> I don't think that we should shutdown node in case of
> > >>>> IgniteOOMException,
> > >>>>>> if one node has no space, then other probably  don't have it too,
> so
> > >> re
> > >>>>>> -balancing will cause IgniteOOM on all other nodes and will kill
> the
> > >>>> whole
> > >>>>>> cluster. I think for some configurations cluster should survive
> and
> > >>>> allow
> > >>>>>> to user clean cache or/and add more nodes.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Mikhail.
> > >>>>>>
> > >>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" <
> > >>>>>> avinogra...@gridgain.com> написал:
> > >>>>>>
> > >>>>>>> Igniters,
> > >>>>>>>
> > >>>>>>> Internal problems may and, unfortunately, cause unexpected
> cluster
> > >>>>>>> behavior.
> > >>>>>>> We should determine behavior in case any of internal problem
> > >> happened.
> > >>>>>>>
> > >>>>>>> Well known internal problems can be split to:
> > >>>>>>> 1) OOM or any other reason cause node crash
> > >>>>>>>
> > >>>>>>> 2) Situations required graceful node shutdown with custom
> > >> notification
> > >>>>>>> - IgniteOutOfMemoryException
> > >>>>>>> - Persistence errors
> > >>>>>>> - ExchangeWorker exits with error
> > >>>>>>>
> > >>>>>>> 3) Prefomance issues should be covered by metrics
> > >>>>>>> - GC STW duration
> > >>>>>>> - Timed out tasks and jobs
> > >>>>>>> - TX deadlock
> > >>>>>>> - Hanged Tx (waits for some service)
> > >>>>>>> - Java Deadlocks
> > >>>>>>>
> > >>>>>>> I created special issue [1] to make sure all these metrics will
> be
> > >>>>>>> presented at WebConsole or VisorConsole (what's preferred?)
> > >>>>>>>
> > >>>>>>> 4) Situations required external monitoring implementation
> > >>>>>>> - GC STW duration exceed maximum possible length (node should be
> > >>>> stopped
> > >>>>>>> before STW finished)
> > >>>>>>>
> > >>>>>>> All this problems were reported by different persons different
> time
> > >>>> ago,
> > >>>>>>> So, we should reanalyze each of them and, possible, find better
> > ways
> > >> to
> > >>>>>>> solve them than it described at issues.
> > >>>>>>>
> > >>>>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention
> > >>>> something
> > >>>>>>> else :)
> > >>>>>>>
> > >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
> > >>>>>>> [2]
> > >>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > >>>>>>> 7%3A+Ignite+internal+problems+detection
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: Ignite Enhancement Proposal #7 (Internal problems detection)

Reply via email to