No objections here. Additional policies like EXEC might be added later depending on user needs.
— Denis > On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <sbt.sorokin....@gmail.com> > wrote: > > Denis, > I propose start with first three policies (it's already implemented, just > await some code combing, commit & review). > About of fourth policy (EXEC) I think that it's rather additional property > (some script path) than policy. > > 2017-11-23 0:43 GMT+03:00 Denis Magda <dma...@apache.org>: > >> Just provide FailureProcessingPolicy with possible reactions: >> - NOOP - exceptions will be reported, metrics will be triggered but an >> affected Ignite process won’t be touched. >> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite >> process termination. >> - RESTART - NOOP actions + process restart. >> - EXEC - execute a custom script provided by the user. >> >> If needed the policy can be set per know failure such is OOM, Persistence >> errors so that the user can act accordingly basing on a context. >> >> — >> Denis >> >>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <voze...@gridgain.com> >> wrote: >>> >>> In the first iteration I would focus only on reporting facilities, to let >>> administrator spot dangerous situation. And in the second phase, when all >>> reporting and metrics are ready, we can think on some automatic actions. >>> >>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < >> mcherka...@gridgain.com >>>> wrote: >>> >>>> Hi Anton, >>>> >>>> I don't think that we should shutdown node in case of >> IgniteOOMException, >>>> if one node has no space, then other probably don't have it too, so re >>>> -balancing will cause IgniteOOM on all other nodes and will kill the >> whole >>>> cluster. I think for some configurations cluster should survive and >> allow >>>> to user clean cache or/and add more nodes. >>>> >>>> Thanks, >>>> Mikhail. >>>> >>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < >>>> avinogra...@gridgain.com> написал: >>>> >>>>> Igniters, >>>>> >>>>> Internal problems may and, unfortunately, cause unexpected cluster >>>>> behavior. >>>>> We should determine behavior in case any of internal problem happened. >>>>> >>>>> Well known internal problems can be split to: >>>>> 1) OOM or any other reason cause node crash >>>>> >>>>> 2) Situations required graceful node shutdown with custom notification >>>>> - IgniteOutOfMemoryException >>>>> - Persistence errors >>>>> - ExchangeWorker exits with error >>>>> >>>>> 3) Prefomance issues should be covered by metrics >>>>> - GC STW duration >>>>> - Timed out tasks and jobs >>>>> - TX deadlock >>>>> - Hanged Tx (waits for some service) >>>>> - Java Deadlocks >>>>> >>>>> I created special issue [1] to make sure all these metrics will be >>>>> presented at WebConsole or VisorConsole (what's preferred?) >>>>> >>>>> 4) Situations required external monitoring implementation >>>>> - GC STW duration exceed maximum possible length (node should be >> stopped >>>>> before STW finished) >>>>> >>>>> All this problems were reported by different persons different time >> ago, >>>>> So, we should reanalyze each of them and, possible, find better ways to >>>>> solve them than it described at issues. >>>>> >>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention >> something >>>>> else :) >>>>> >>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 >>>>> [2] >>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- >>>>> 7%3A+Ignite+internal+problems+detection >>>>> >>>> >> >>