If an Ignite operation hangs by some reason due to an internal problem or buggy application code it needs to eventual *time out*.
Take atomic operations case brought by Val to our attention recently: http://apache-ignite-developers.2346864.n4.nabble.com/Timeouts-in-atomic-cache-td19839.html An application must not freeze waiting for a human being intervention if an atomic update fails internally. Even more I would let all possible operation to time out: - Ignite compute computations. - Ignite services calls. - Atomic/transactional cache updates. - SQL queries. I’m not sure this is covered by any of the tickets from the IEP-7. Any thoughts/suggestion before the one is created? — Denis > On Nov 20, 2017, at 8:56 AM, Anton Vinogradov <avinogra...@gridgain.com> > wrote: > > Dmitry, > > There's two cases > 1) STW duration is long -> notifying monitoring via JMX metric > > 2) STW duration exceed N seconds -> no need to wait for something. > We already know that node will be segmented or that pause bigger that N > seconds will affect cluster performance. > Better case is to kill node ASAP to protect the cluster. Some customers > have huge timeouts and such node can kill whole cluster in case it will not > be killed by watchdog. > > On Mon, Nov 20, 2017 at 7:23 PM, Dmitry Pavlov <dpavlov....@gmail.com> > wrote: > >> Hi Anton, >> >>> - GC STW duration exceed maximum possible length (node should be stopped >> before >> STW finished) >> >> Are you sure we should kill node in case long STW? Can we produce warnings >> into logs and monitoring tools an wait node to become alive a little bit >> longer if we detect STW. In this case we can notify coordinator or other >> node, that 'current node is in STW, please wait longer than 3 heartbeat >> timeout'. >> >> It is probable such pauses will occur not often? >> >> Sincerely, >> Dmitriy Pavlov >> >> пн, 20 нояб. 2017 г. в 18:53, Anton Vinogradov <avinogra...@gridgain.com>: >> >>> Igniters, >>> >>> Internal problems may and, unfortunately, cause unexpected cluster >>> behavior. >>> We should determine behavior in case any of internal problem happened. >>> >>> Well known internal problems can be split to: >>> 1) OOM or any other reason cause node crash >>> >>> 2) Situations required graceful node shutdown with custom notification >>> - IgniteOutOfMemoryException >>> - Persistence errors >>> - ExchangeWorker exits with error >>> >>> 3) Prefomance issues should be covered by metrics >>> - GC STW duration >>> - Timed out tasks and jobs >>> - TX deadlock >>> - Hanged Tx (waits for some service) >>> - Java Deadlocks >>> >>> I created special issue [1] to make sure all these metrics will be >>> presented at WebConsole or VisorConsole (what's preferred?) >>> >>> 4) Situations required external monitoring implementation >>> - GC STW duration exceed maximum possible length (node should be stopped >>> before STW finished) >>> >>> All this problems were reported by different persons different time ago, >>> So, we should reanalyze each of them and, possible, find better ways to >>> solve them than it described at issues. >>> >>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention something >>> else :) >>> >>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 >>> [2] >>> >>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- >> 7%3A+Ignite+internal+problems+detection >>> >>