What do you think about including patches [1] and [2] to Ignite 2.7.5? It's all about default failure handler behavior in cases of SYSTEM_WORKER_BLOCKED and SYSTEM_CRITICAL_OPERATION_TIMEOUT.
Andrey Kuznetsov, could you please check, does IGNITE-10003 depend on other issue that isn't included into 2.7 release? [1] https://issues.apache.org/jira/browse/IGNITE-10154 [2] https://issues.apache.org/jira/browse/IGNITE-10003 On Wed, Mar 27, 2019 at 8:11 AM Denis Magda <dma...@apache.org> wrote: > > Folks, thanks for sharing details and inputs. This is helpful. As long as I > spend a lot of time working with Ignite users, I'll look into this topic in > a couple of days to propose some changes. In the meantime, here is a fresh > one report on the user list: > http://apache-ignite-users.70518.x6.nabble.com/Triggering-Rebalancing-Programmatically-get-error-while-requesting-td27651.html > > > - > Denis > > > On Tue, Mar 26, 2019 at 9:04 AM Andrey Gura <ag...@apache.org> wrote: > > > CleanupWorker termination can lead to the following effects: > > > > - Queries can retrieve data that have to expired so application will > > behave incorrectly. > > - Memory and/or disc can be overflowed because entries weren't expired. > > - Performance degradation is possible due to unmanageable data set grows. > > > > On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh <rsht...@yahoo.com.invalid> > > wrote: > > > > > > Vyacheslav, if you are talking about this particular case I described, I > > believe it has no influence on PME. What could happen is having > > CleanupWorker thread dead (which is not good too).But I believe we are > > talking in a wider scope. > > > > > > -- Roman > > > > > > > > > On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur < > > daradu...@gmail.com> wrote: > > > > > > In general I agree with Andrey, the handler is very usefull itself. It > > > allows us to become know that ‘GridDhtInvalidPartitionException’ is not > > > processed properly in PME process by worker. > > > > > > Nikolay, look at the code, if Failure Handler hadles an exception - this > > > means that while-true loop in worker’s body has been interrupted with > > > unexpected exception and thread is completed his lifecycle. > > > > > > Without Failure Hanller, in the current case, the cluster will hang, > > > because of unable to participate in PME process. > > > > > > So, the problem is the incorrect handling of the exception in PME’s task > > > wich should be fixed. > > > > > > > > > вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov <stku...@gmail.com>: > > > > > > > Nikolay, > > > > > > > > Feel free to suggest better error messages to indicate > > internal/critical > > > > failures. User actions in response to critical failures are rather > > limited: > > > > mail to user-list or maybe file an issue. As for repetitive warnings, > > it > > > > makes sense, but requires additional stuff to deliver such signals, > > mere > > > > spamming to log will not have an effect. > > > > > > > > Anyway, when experienced committers suggest to disable failure > > handling and > > > > hide existing issues, I feel as if they are pulling my leg. > > > > > > > > Best regards, > > > > Andrey Kuznetsov. > > > > > > > > вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org: > > > > > > > > > Andrey. > > > > > > > > > > > the thread can be made non-critical, and we can restart it every > > time > > > > it > > > > > dies > > > > > > > > > > Why we can't restart critical thread? > > > > > What is the root difference between critical and non critical > > threads? > > > > > > > > > > > It's much simpler to catch and handle all exceptions in critical > > > > threads > > > > > > > > > > I don't agree with you. > > > > > We develop Ignite not because it simple! > > > > > We must spend extra time to made it robust and resilient to the > > failures. > > > > > > > > > > > Failure handling is a last-chance tool that reveals internal Ignite > > > > > errors > > > > > > 100% agree with you: overcome, but not hide. > > > > > > > > > > Logging stack trace with proper explanation is not hiding. > > > > > Killing nodes and whole cluster is not "handling". > > > > > > > > > > > As far as I see from user-list messages, our users are qualified > > enough > > > > > to provide necessary information from their cluster-wide logs. > > > > > > > > > > We shouldn't develop our product only for users who are able to read > > > > Ignite > > > > > sources to decrypt the fail reason behind "starvation in stripped > > pool" > > > > > > > > > > Some of my questions remain unanswered :) : > > > > > > > > > > 1. How user can know it's an Ignite bug? Where this bug should be > > > > reported? > > > > > 2. Do we log it somewhere? > > > > > 3. Do we warn user before shutdown several times? > > > > > 4. "starvation in stripped pool" I think it's not clear error > > message. > > > > > Let's make it more specific! > > > > > 5. Let's write to the user log - what he or she should do to prevent > > this > > > > > error in future? > > > > > > > > > > > > > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <stku...@gmail.com>: > > > > > > > > > > > Nikolay, > > > > > > > > > > > > > Why we can't restart some thread? > > > > > > Technically, we can. It's just matter of design: the thread can be > > made > > > > > > non-critical, and we can restart it every time it dies. But such > > design > > > > > > looks poor to me. It's much simpler to catch and handle all > > exceptions > > > > in > > > > > > critical threads. Failure handling is a last-chance tool that > > reveals > > > > > > internal Ignite errors. It's not pleasant for us when users see > > these > > > > > > errors, but it's better than hiding. > > > > > > > > > > > > > Actually, distributed systems are designed to overcome some > > bugs, > > > > > thread > > > > > > failure, node failure, for example, isn't it? > > > > > > 100% agree with you: overcome, but not hide. > > > > > > > > > > > > > How user can know it's a bug? Where this bug should be reported? > > > > > > As far as I see from user-list messages, our users are qualified > > enough > > > > > to > > > > > > provide necessary information from their cluster-wide logs. > > > > > > > > > > > > > > > > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <nizhi...@apache.org > > >: > > > > > > > > > > > > > Andrey. > > > > > > > > > > > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is > > no > > > > use > > > > > > to > > > > > > > wait for dead thread's magical resurrection. > > > > > > > > > > > > > > Why is it unrecoverable? > > > > > > > Why we can't restart some thread? > > > > > > > Is there some kind of nature limitation to not restart system > > thread? > > > > > > > > > > > > > > Actually, distributed systems are designed to overcome some bugs, > > > > > thread > > > > > > > failure, node failure, for example, isn't it? > > > > > > > > if under some circumstances node> stop leads to cascade cluster > > > > > crash, > > > > > > > then it's a bug > > > > > > > > > > > > > > How user can know it's a bug? Where this bug should be reported? > > > > > > > Do we log it somewhere? > > > > > > > Do we warn user before shutdown one or several times? > > > > > > > > > > > > > > This feature kills user experience literally now. > > > > > > > > > > > > > > If I would be a user of the product that just shutdown with poor > > log > > > > I > > > > > > > would throw this product away. > > > > > > > Do we want it for Ignite? > > > > > > > > > > > > > > From SO discussion I see following error message: ": >>> Possible > > > > > > > starvation in striped pool." > > > > > > > Are you sure this message are clear for Ignite user(not Ignite > > > > hacker)? > > > > > > > What user should do to prevent this error in future? > > > > > > > > > > > > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет: > > > > > > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. > > I > > > > > don't > > > > > > > like > > > > > > > > this behavior, but it may be useful sometimes: "frozen" threads > > > > have > > > > > a > > > > > > > > chance to become active again after load decreases. As for > > > > > > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use > > to > > > > > wait > > > > > > > for > > > > > > > > dead thread's magical resurrection. Then, if under some > > > > circumstances > > > > > > > node > > > > > > > > stop leads to cascade cluster crash, then it's a bug, and it > > should > > > > > be > > > > > > > > fixed. Once and for all. Instead of hiding the flaw we have in > > the > > > > > > > product. > > > > > > > > > > > > > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh > > > > <rsht...@yahoo.com.invalid > > > > > >: > > > > > > > > > > > > > > > > > + 1 for having the default settings revisited. > > > > > > > > > I understand Andrey's reasonings, but sometimes taking nodes > > down > > > > > is > > > > > > > too > > > > > > > > > radical (as in my case it was > > GridDhtInvalidPartitionException > > > > > which > > > > > > > could > > > > > > > > > be ignored for a while when rebalancing <- I might be wrong > > > > here). > > > > > > > > > > > > > > > > > > -- Roman > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis > > Magda < > > > > > > > > > dma...@apache.org> wrote: > > > > > > > > > > > > > > > > > > p Nikolay, > > > > > > > > > Thanks for kicking off this discussion. Surprisingly, > > planned to > > > > > > start > > > > > > > a > > > > > > > > > similar one today and incidentally came across this thread. > > > > > > > > > Agree that the failure handler should be off by default or > > the > > > > > > default > > > > > > > > > settings have to be revisited. That's true that people are > > > > > > complaining > > > > > > > of > > > > > > > > > nodes shutdowns even on moderate workloads. For instance, > > that's > > > > > the > > > > > > > most > > > > > > > > > recent feedback related to slow checkpointing: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure > > > > > > > > > > > > > > > > > > At a minimum, let's consider the following: > > > > > > > > > - A failure handler needs to provide hints on how to come > > > > around > > > > > > the > > > > > > > > > shutdown in the future. Take the checkpointing SO thread > > above. > > > > > It's > > > > > > > > > unclear from the logs how to prevent the same situation next > > time > > > > > > > (suggest > > > > > > > > > parameters for tuning, flash drives, etc). > > > > > > > > > - Is there any protection for a full cluster restart? We > > need > > > > to > > > > > > > > > distinguish a slow cluster from the stuck one. A node removal > > > > > should > > > > > > > not > > > > > > > > > lead to a meltdown of the whole storage. > > > > > > > > > - Should we enable the failure handler for things like > > > > > > transactions > > > > > > > or > > > > > > > > > PME and have it off for checkpointing and something else? > > Let's > > > > > have > > > > > > it > > > > > > > > > enabled for cases when we are 100% certain that a node > > shutdown > > > > is > > > > > > the > > > > > > > > > right thing and print out warnings with suggestions whenever > > > > we're > > > > > > not > > > > > > > > > confident that the removal is appropriate. > > > > > > > > > --Denis > > > > > > > > > > > > > > > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura < > > ag...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > > > > > > Failure handlers were introduced in order to avoid cluster > > > > hanging > > > > > > and > > > > > > > > > they kill nodes instead. > > > > > > > > > > > > > > > > > > If critical worker was terminated by > > > > > GridDhtInvalidPartitionException > > > > > > > > > then your node is unable to work anymore. > > > > > > > > > > > > > > > > > > Unexpected cluster shutdown with reasons in logs that failure > > > > > > handlers > > > > > > > > > provide is better than hanging. So answer is NO. We mustn't > > > > disable > > > > > > > > > failure handlers. > > > > > > > > > > > > > > > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh > > > > > > <rsht...@yahoo.com.invalid > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > If it sticks to the behavior we had before introducing > > failure > > > > > > > handler, > > > > > > > > > > > > > > > > > > I think it's better to have disabled instead of killing the > > whole > > > > > > > cluster, > > > > > > > > > as in my case, and create a parent issue for those ten > > > > bugs.Pavel, > > > > > > > thanks > > > > > > > > > for the suggestion! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay > > > > > Izhikov > > > > > > < > > > > > > > > > > > > > > > > > > nizhi...@apache.org> wrote: > > > > > > > > > > > > > > > > > > > > Guys. > > > > > > > > > > > > > > > > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for > > all. > > > > > > > > > > Seems, we have ten or more "cluster shutdown" bugs with > > this > > > > > > > subsystem > > > > > > > > > > since it was introduced. > > > > > > > > > > > > > > > > > > > > Should we disable it by default in 2.7.5? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko < > > > > jokse...@gmail.com > > > > > >: > > > > > > > > > > > > > > > > > > > > > Hi Roman, > > > > > > > > > > > > > > > > > > > > > > I think this InvalidPartition case can be simply handled > > > > > > > > > > > in GridCacheTtlManager.expire method. > > > > > > > > > > > For workaround a custom FailureHandler can be configured > > that > > > > > > will > > > > > > > not > > > > > > > > > > > > > > > > > > stop > > > > > > > > > > > a node in case of such exception is thrown. > > > > > > > > > > > > > > > > > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh > > > > > > > <rsht...@yahoo.com.invalid>: > > > > > > > > > > > > > > > > > > > > > > > Igniters, > > > > > > > > > > > > > > > > > > > > > > > > Restarting a node when injecting data and having it > > > > expired, > > > > > > > results > > > > > > > > > > > > > > > > > > at > > > > > > > > > > > > GridDhtInvalidPartitionException which terminates nodes > > > > with > > > > > > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole > > > > cluster > > > > > > > down. > > > > > > > > > > > > > > > > > > This > > > > > > > > > > > is > > > > > > > > > > > > really bad and I didn't find the way to save the > > cluster > > > > from > > > > > > > > > > > > > > > > > > > > > > disappearing. > > > > > > > > > > > > I created a JIRA issue > > > > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620 > > > > > > > > > > > > with a test case. Any clues how to fix this > > inconsistency > > > > > when > > > > > > > > > > > > > > > > > > > > > > rebalancing? > > > > > > > > > > > > > > > > > > > > > > > > -- Roman > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > > > > Andrey Kuznetsov. > > > > > > > > > > > > > > > > > > -- > > > Best Regards, Vyacheslav D. > >