Re: Ignite Enhancement Proposal #7 (Internal problems detection)

Denis Magda Mon, 20 Nov 2017 14:54:06 -0800

If an Ignite operation hangs by some reason due to an internal problem or buggy 
application code it needs to eventual *time out*.


Take atomic operations case brought by Val to our attention recently:
http://apache-ignite-developers.2346864.n4.nabble.com/Timeouts-in-atomic-cache-td19839.html

An application must not freeze waiting for a human being intervention if an 
atomic update fails internally.

Even more I would let all possible operation to time out:
- Ignite compute computations.
- Ignite services calls.
- Atomic/transactional cache updates.
- SQL queries.

I’m not sure this is covered by any of the tickets from the IEP-7. Any 
thoughts/suggestion before the one is created?

—
Denis
 
> On Nov 20, 2017, at 8:56 AM, Anton Vinogradov <avinogra...@gridgain.com> 
> wrote:
> 
> Dmitry,
> 
> There's two cases
> 1) STW duration is long -> notifying monitoring via JMX metric
> 
> 2) STW duration exceed N seconds -> no need to wait for something.
> We already know that node will be segmented or that pause bigger that N
> seconds will affect cluster performance.
> Better case is to kill node ASAP to protect the cluster. Some customers
> have huge timeouts and such node can kill whole cluster in case it will not
> be killed by watchdog.
> 
> On Mon, Nov 20, 2017 at 7:23 PM, Dmitry Pavlov <dpavlov....@gmail.com>
> wrote:
> 
>> Hi Anton,
>> 
>>> - GC STW duration exceed maximum possible length (node should be stopped
>> before
>> STW finished)
>> 
>> Are you sure we should kill node in case long STW? Can we produce warnings
>> into logs and monitoring tools an wait node to become alive a little bit
>> longer if we detect STW. In this case we can notify coordinator or other
>> node, that 'current node is in STW, please wait longer than 3 heartbeat
>> timeout'.
>> 
>> It is probable such pauses will occur not often?
>> 
>> Sincerely,
>> Dmitriy Pavlov
>> 
>> пн, 20 нояб. 2017 г. в 18:53, Anton Vinogradov <avinogra...@gridgain.com>:
>> 
>>> Igniters,
>>> 
>>> Internal problems may and, unfortunately, cause unexpected cluster
>>> behavior.
>>> We should determine behavior in case any of internal problem happened.
>>> 
>>> Well known internal problems can be split to:
>>> 1) OOM or any other reason cause node crash
>>> 
>>> 2) Situations required graceful node shutdown with custom notification
>>> - IgniteOutOfMemoryException
>>> - Persistence errors
>>> - ExchangeWorker exits with error
>>> 
>>> 3) Prefomance issues should be covered by metrics
>>> - GC STW duration
>>> - Timed out tasks and jobs
>>> - TX deadlock
>>> - Hanged Tx (waits for some service)
>>> - Java Deadlocks
>>> 
>>> I created special issue [1] to make sure all these metrics will be
>>> presented at WebConsole or VisorConsole (what's preferred?)
>>> 
>>> 4) Situations required external monitoring implementation
>>> - GC STW duration exceed maximum possible length (node should be stopped
>>> before STW finished)
>>> 
>>> All this problems were reported by different persons different time ago,
>>> So, we should reanalyze each of them and, possible, find better ways to
>>> solve them than it described at issues.
>>> 
>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention something
>>> else :)
>>> 
>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
>>> [2]
>>> 
>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
>> 7%3A+Ignite+internal+problems+detection
>>> 
>>

Re: Ignite Enhancement Proposal #7 (Internal problems detection)

Reply via email to