By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't like this behavior, but it may be useful sometimes: "frozen" threads have a chance to become active again after load decreases. As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait for dead thread's magical resurrection. Then, if under some circumstances node stop leads to cascade cluster crash, then it's a bug, and it should be fixed. Once and for all. Instead of hiding the flaw we have in the product.
вт, 26 мар. 2019 г. в 09:17, Roman Shtykh <rsht...@yahoo.com.invalid>: > + 1 for having the default settings revisited. > I understand Andrey's reasonings, but sometimes taking nodes down is too > radical (as in my case it was GridDhtInvalidPartitionException which could > be ignored for a while when rebalancing <- I might be wrong here). > > -- Roman > > > On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda < > dma...@apache.org> wrote: > > Nikolay, > Thanks for kicking off this discussion. Surprisingly, planned to start a > similar one today and incidentally came across this thread. > Agree that the failure handler should be off by default or the default > settings have to be revisited. That's true that people are complaining of > nodes shutdowns even on moderate workloads. For instance, that's the most > recent feedback related to slow checkpointing: > https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure > > At a minimum, let's consider the following: > - A failure handler needs to provide hints on how to come around the > shutdown in the future. Take the checkpointing SO thread above. It's > unclear from the logs how to prevent the same situation next time (suggest > parameters for tuning, flash drives, etc). > - Is there any protection for a full cluster restart? We need to > distinguish a slow cluster from the stuck one. A node removal should not > lead to a meltdown of the whole storage. > - Should we enable the failure handler for things like transactions or > PME and have it off for checkpointing and something else? Let's have it > enabled for cases when we are 100% certain that a node shutdown is the > right thing and print out warnings with suggestions whenever we're not > confident that the removal is appropriate. > --Denis > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <ag...@apache.org> wrote: > > Failure handlers were introduced in order to avoid cluster hanging and > they kill nodes instead. > > If critical worker was terminated by GridDhtInvalidPartitionException > then your node is unable to work anymore. > > Unexpected cluster shutdown with reasons in logs that failure handlers > provide is better than hanging. So answer is NO. We mustn't disable > failure handlers. > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh <rsht...@yahoo.com.invalid> > wrote: > > > > If it sticks to the behavior we had before introducing failure handler, > I think it's better to have disabled instead of killing the whole cluster, > as in my case, and create a parent issue for those ten bugs.Pavel, thanks > for the suggestion! > > > > > > > > On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov < > nizhi...@apache.org> wrote: > > > > Guys. > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for all. > > Seems, we have ten or more "cluster shutdown" bugs with this subsystem > > since it was introduced. > > > > Should we disable it by default in 2.7.5? > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <jokse...@gmail.com>: > > > > > Hi Roman, > > > > > > I think this InvalidPartition case can be simply handled > > > in GridCacheTtlManager.expire method. > > > For workaround a custom FailureHandler can be configured that will not > stop > > > a node in case of such exception is thrown. > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh <rsht...@yahoo.com.invalid>: > > > > > > > Igniters, > > > > > > > > Restarting a node when injecting data and having it expired, results > at > > > > GridDhtInvalidPartitionException which terminates nodes with > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down. > This > > > is > > > > really bad and I didn't find the way to save the cluster from > > > disappearing. > > > > I created a JIRA issue > > > https://issues.apache.org/jira/browse/IGNITE-11620 > > > > with a test case. Any clues how to fix this inconsistency when > > > rebalancing? > > > > > > > > -- Roman > > > > > > > > > -- Best regards, Andrey Kuznetsov.