Just provide FailureProcessingPolicy with possible reactions: - NOOP - exceptions will be reported, metrics will be triggered but an affected Ignite process won’t be touched. - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite process termination. - RESTART - NOOP actions + process restart. - EXEC - execute a custom script provided by the user.
If needed the policy can be set per know failure such is OOM, Persistence errors so that the user can act accordingly basing on a context. — Denis > On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <voze...@gridgain.com> wrote: > > In the first iteration I would focus only on reporting facilities, to let > administrator spot dangerous situation. And in the second phase, when all > reporting and metrics are ready, we can think on some automatic actions. > > On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <mcherka...@gridgain.com >> wrote: > >> Hi Anton, >> >> I don't think that we should shutdown node in case of IgniteOOMException, >> if one node has no space, then other probably don't have it too, so re >> -balancing will cause IgniteOOM on all other nodes and will kill the whole >> cluster. I think for some configurations cluster should survive and allow >> to user clean cache or/and add more nodes. >> >> Thanks, >> Mikhail. >> >> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < >> avinogra...@gridgain.com> написал: >> >>> Igniters, >>> >>> Internal problems may and, unfortunately, cause unexpected cluster >>> behavior. >>> We should determine behavior in case any of internal problem happened. >>> >>> Well known internal problems can be split to: >>> 1) OOM or any other reason cause node crash >>> >>> 2) Situations required graceful node shutdown with custom notification >>> - IgniteOutOfMemoryException >>> - Persistence errors >>> - ExchangeWorker exits with error >>> >>> 3) Prefomance issues should be covered by metrics >>> - GC STW duration >>> - Timed out tasks and jobs >>> - TX deadlock >>> - Hanged Tx (waits for some service) >>> - Java Deadlocks >>> >>> I created special issue [1] to make sure all these metrics will be >>> presented at WebConsole or VisorConsole (what's preferred?) >>> >>> 4) Situations required external monitoring implementation >>> - GC STW duration exceed maximum possible length (node should be stopped >>> before STW finished) >>> >>> All this problems were reported by different persons different time ago, >>> So, we should reanalyze each of them and, possible, find better ways to >>> solve them than it described at issues. >>> >>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention something >>> else :) >>> >>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 >>> [2] >>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- >>> 7%3A+Ignite+internal+problems+detection >>> >>