Dmitry, Thank you, but how FailureProcessingPolicy looks like? It is not clear how can I configure different reactions to different event types.
On Wed, Nov 29, 2017 at 1:47 PM, Дмитрий Сорокин <sbt.sorokin....@gmail.com> wrote: > Vladimir, > > These policies (policy, in fact) can be configured in IgniteConfiguration > by calling setFailureProcessingPolicy(FailureProcessingPolicy flrPlc) > method. > > 2017-11-29 10:35 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>: > > > Denis, > > > > Yes, but can we look at proposed API before we dig into implementation? > > > > On Tue, Nov 28, 2017 at 9:43 PM, Denis Magda <dma...@apache.org> wrote: > > > > > I think the failure processing policy should be configured via > > > IgniteConfiguration in a way similar to the segmentation policies. > > > > > > — > > > Denis > > > > > > > On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov <voze...@gridgain.com> > > > wrote: > > > > > > > > Dmitry, > > > > > > > > How these policies will be configured? Do you have any API in mind? > > > > > > > > On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <dma...@apache.org> > > wrote: > > > > > > > >> No objections here. Additional policies like EXEC might be added > later > > > >> depending on user needs. > > > >> > > > >> — > > > >> Denis > > > >> > > > >>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин < > > > sbt.sorokin....@gmail.com> > > > >> wrote: > > > >>> > > > >>> Denis, > > > >>> I propose start with first three policies (it's already > implemented, > > > just > > > >>> await some code combing, commit & review). > > > >>> About of fourth policy (EXEC) I think that it's rather additional > > > >> property > > > >>> (some script path) than policy. > > > >>> > > > >>> 2017-11-23 0:43 GMT+03:00 Denis Magda <dma...@apache.org>: > > > >>> > > > >>>> Just provide FailureProcessingPolicy with possible reactions: > > > >>>> - NOOP - exceptions will be reported, metrics will be triggered > but > > an > > > >>>> affected Ignite process won’t be touched. > > > >>>> - HAULT (or STOP or KILL) - all the actions of the of NOOP + > Ignite > > > >>>> process termination. > > > >>>> - RESTART - NOOP actions + process restart. > > > >>>> - EXEC - execute a custom script provided by the user. > > > >>>> > > > >>>> If needed the policy can be set per know failure such is OOM, > > > >> Persistence > > > >>>> errors so that the user can act accordingly basing on a context. > > > >>>> > > > >>>> — > > > >>>> Denis > > > >>>> > > > >>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov < > > voze...@gridgain.com> > > > >>>> wrote: > > > >>>>> > > > >>>>> In the first iteration I would focus only on reporting > facilities, > > to > > > >> let > > > >>>>> administrator spot dangerous situation. And in the second phase, > > when > > > >> all > > > >>>>> reporting and metrics are ready, we can think on some automatic > > > >> actions. > > > >>>>> > > > >>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < > > > >>>> mcherka...@gridgain.com > > > >>>>>> wrote: > > > >>>>> > > > >>>>>> Hi Anton, > > > >>>>>> > > > >>>>>> I don't think that we should shutdown node in case of > > > >>>> IgniteOOMException, > > > >>>>>> if one node has no space, then other probably don't have it > too, > > so > > > >> re > > > >>>>>> -balancing will cause IgniteOOM on all other nodes and will kill > > the > > > >>>> whole > > > >>>>>> cluster. I think for some configurations cluster should survive > > and > > > >>>> allow > > > >>>>>> to user clean cache or/and add more nodes. > > > >>>>>> > > > >>>>>> Thanks, > > > >>>>>> Mikhail. > > > >>>>>> > > > >>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < > > > >>>>>> avinogra...@gridgain.com> написал: > > > >>>>>> > > > >>>>>>> Igniters, > > > >>>>>>> > > > >>>>>>> Internal problems may and, unfortunately, cause unexpected > > cluster > > > >>>>>>> behavior. > > > >>>>>>> We should determine behavior in case any of internal problem > > > >> happened. > > > >>>>>>> > > > >>>>>>> Well known internal problems can be split to: > > > >>>>>>> 1) OOM or any other reason cause node crash > > > >>>>>>> > > > >>>>>>> 2) Situations required graceful node shutdown with custom > > > >> notification > > > >>>>>>> - IgniteOutOfMemoryException > > > >>>>>>> - Persistence errors > > > >>>>>>> - ExchangeWorker exits with error > > > >>>>>>> > > > >>>>>>> 3) Prefomance issues should be covered by metrics > > > >>>>>>> - GC STW duration > > > >>>>>>> - Timed out tasks and jobs > > > >>>>>>> - TX deadlock > > > >>>>>>> - Hanged Tx (waits for some service) > > > >>>>>>> - Java Deadlocks > > > >>>>>>> > > > >>>>>>> I created special issue [1] to make sure all these metrics will > > be > > > >>>>>>> presented at WebConsole or VisorConsole (what's preferred?) > > > >>>>>>> > > > >>>>>>> 4) Situations required external monitoring implementation > > > >>>>>>> - GC STW duration exceed maximum possible length (node should > be > > > >>>> stopped > > > >>>>>>> before STW finished) > > > >>>>>>> > > > >>>>>>> All this problems were reported by different persons different > > time > > > >>>> ago, > > > >>>>>>> So, we should reanalyze each of them and, possible, find better > > > ways > > > >> to > > > >>>>>>> solve them than it described at issues. > > > >>>>>>> > > > >>>>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention > > > >>>> something > > > >>>>>>> else :) > > > >>>>>>> > > > >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 > > > >>>>>>> [2] > > > >>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > >>>>>>> 7%3A+Ignite+internal+problems+detection > > > >>>>>>> > > > >>>>>> > > > >>>> > > > >>>> > > > >> > > > >> > > > > > > > > >