Hi, Sorry for being too formal here, but IGNITE-10003 <https://issues.apache.org/jira/browse/IGNITE-10003> is in progress.
Also, I've tried to find anything related to it in the list. So according to the list, no one was asking to include. Sincerely, Dmitriy Pavlov ср, 19 дек. 2018 г. в 13:24, Nikolay Izhikov <nizhi...@apache.org>: > Hello, Alexey. > > No, we don't include this ticket to 2.7. > Should we? > > ср, 19 дек. 2018 г. в 12:55, Alexey Goncharuk <alexey.goncha...@gmail.com > >: > > > Folks, why did not we include IGNITE-10003 to ignite-2.7 release scope? > > This causes an Ignite node to be stopped by default when checkpoint read > > lock acquire times out. I expect a lot of Ignite 2.7 users will be > affected > > by this mistake. > > > > We should at least update the documentation and make users aware of a > > workaround. > > > > чт, 25 окт. 2018 г. в 16:35, Alexey Goncharuk < > alexey.goncha...@gmail.com > > >: > > > > > Andrey, > > > > > > I still see that checkpoint read lock acquisition raises a > > CRITICAL_ERROR, > > > which by default will shut down local node. As far as I remember, we > > > decided that by default thread timeout should not trigger node failure. > > > Now, however, it does, because we ignore SYSTEM_WORKER_BLOCKED events > in > > > default configuration. > > > > > > Should we introduce another critical failure type > > > CHECKPOINT_READ_LOCK_BLOCKED or use SYSTEM_WORKER_BLOCKED for > checkpoint > > > read lock acquire failure? > > > > > > --AG > > > > > > пт, 12 окт. 2018 г. в 8:29, Andrey Kuznetsov <stku...@gmail.com>: > > > > > >> Igniters, > > >> > > >> Now I spot blocking / long-running code arising from > > >> {{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger > > >> thread, see [1]. Ideally, all blocking operations along all possible > > code > > >> paths should be guarded implicitly from critical failure detector to > > avoid > > >> the thread from being considered blocked. There is a pull request [2] > > that > > >> provides shallow solution. I didn't change code outside > > >> {{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by > any > > >> upcoming change. Also, I didn't touch the code runnable by threads > other > > >> than partition-exchanger. So I have a number of guarded sections that > > are > > >> wider than they could be, and this potentially hides issues from > failure > > >> detector. Does this PR make sense? Or maybe it's better to exclude > > >> partition-exchanger from critical threads registry at all? > > >> > > >> [1] https://issues.apache.org/jira/browse/IGNITE-9710 > > >> [2] https://github.com/apache/ignite/pull/4962 > > >> > > >> > > >> пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov <maxmu...@gmail.com>: > > >> > > >> > Andrey, Andrey > > >> > > > >> > > Thanks for being attentive! It's definitely a typo. Could you > please > > >> > create > > >> > > an issue? > > >> > > > >> > I've created an issue [1] and prepared PR [2]. > > >> > Please, review this change. > > >> > > > >> > [1] https://issues.apache.org/jira/browse/IGNITE-9723 > > >> > [2] https://github.com/apache/ignite/pull/4862 > > >> > > > >> > On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov <yzhda...@apache.org> > > wrote: > > >> > > > >> > > Config option + mbean access. Does that make sense? > > >> > > > > >> > > Yakov > > >> > > > > >> > > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov <voze...@gridgain.com > > > > >> > wrote: > > >> > > > > >> > > > Then it should be config option. > > >> > > > > > >> > > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura <ag...@apache.org>: > > >> > > > > > >> > > > > Guys, > > >> > > > > > > >> > > > > why we need both config option and system property? I believe > > one > > >> way > > >> > > is > > >> > > > > enough. > > >> > > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov < > > >> > nizhi...@apache.org> > > >> > > > > wrote: > > >> > > > > > > > >> > > > > > Ticket created - > > >> https://issues.apache.org/jira/browse/IGNITE-9737 > > >> > > > > > > > >> > > > > > Fixed version is 2.7. > > >> > > > > > > > >> > > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет: > > >> > > > > > > Nikolay, I agree, a user should be able to disable both > > thread > > >> > > > liveness > > >> > > > > > > check and checkpoint read lock timeout check from config > > and a > > >> > > system > > >> > > > > > > property. > > >> > > > > > > > > >> > > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov < > > >> > nizhi...@apache.org > > >> > > >: > > >> > > > > > > > > >> > > > > > > > Hello, Igniters. > > >> > > > > > > > > > >> > > > > > > > I found that this feature can't be disabled from config. > > >> > > > > > > > The only way to disable it is from JMX bean. > > >> > > > > > > > > > >> > > > > > > > I think it very dangerous: If we have some corner case > or > > a > > >> bug > > >> > > in > > >> > > > > this > > >> > > > > > > > Watch Dog it can make Ignite unusable. > > >> > > > > > > > I propose to implement possibility to disable this > feature > > >> > both - > > >> > > > > from > > >> > > > > > > > config and from JVM options. > > >> > > > > > > > > > >> > > > > > > > What do you think? > > >> > > > > > > > > > >> > > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет: > > >> > > > > > > > > Maxim, > > >> > > > > > > > > > > >> > > > > > > > > Thanks for being attentive! It's definitely a typo. > > Could > > >> you > > >> > > > > please > > >> > > > > > > > > > >> > > > > > > > create > > >> > > > > > > > > an issue? > > >> > > > > > > > > > > >> > > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov < > > >> > > > maxmu...@gmail.com > > >> > > > > >: > > >> > > > > > > > > > > >> > > > > > > > > > Folks, > > >> > > > > > > > > > > > >> > > > > > > > > > I've found in > `GridCachePartitionExchangeManager:2684` > > >> [1] > > >> > > > > (master > > >> > > > > > > > > > >> > > > > > > > branch) > > >> > > > > > > > > > exchange future wrapped > > >> > > > > > > > > > with double `blockingSectionEnd` method. Is it > > correct? > > >> I > > >> > > just > > >> > > > > want to > > >> > > > > > > > > > understand this change and > > >> > > > > > > > > > how should I use this in the future. > > >> > > > > > > > > > > > >> > > > > > > > > > Should I file a new issue to fix this? I think here > > >> > > > > > > > > > >> > > > > > > > `blockingSectionBegin` > > >> > > > > > > > > > method should be used. > > >> > > > > > > > > > > > >> > > > > > > > > > ------------- > > >> > > > > > > > > > blockingSectionEnd(); > > >> > > > > > > > > > > > >> > > > > > > > > > try { > > >> > > > > > > > > > resVer = exchFut.get(exchTimeout, > > >> > TimeUnit.MILLISECONDS); > > >> > > > > > > > > > } finally { > > >> > > > > > > > > > blockingSectionEnd(); > > >> > > > > > > > > > } > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > [1] > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684 > > >> > > > > > > > > > > > >> > > > > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur < > > >> > > > > daradu...@gmail.com> > > >> > > > > > > > > > wrote: > > >> > > > > > > > > > > > >> > > > > > > > > > > Andrey Gura, thank you for the answer! > > >> > > > > > > > > > > > > >> > > > > > > > > > > I agree that wrapping of 'init' method reduces the > > >> profit > > >> > > of > > >> > > > > watchdog > > >> > > > > > > > > > > service in case of PME worker, but in other cases, > > we > > >> > > should > > >> > > > > wrap all > > >> > > > > > > > > > > possible long sections on > > >> GridDhtPartitionExchangeFuture. > > >> > > For > > >> > > > > example > > >> > > > > > > > > > > 'onCacheChangeRequest' method or > > >> > > > > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside > > because > > >> it > > >> > > may > > >> > > > > take > > >> > > > > > > > > > > significant time (reproducer attached). > > >> > > > > > > > > > > > > >> > > > > > > > > > > I only want to point out a possible issue which > may > > >> allow > > >> > > to > > >> > > > > end-user > > >> > > > > > > > > > > halt the Ignite cluster accidentally. > > >> > > > > > > > > > > > > >> > > > > > > > > > > I'm sure that PME experts know how to fix this > issue > > >> > > > properly. > > >> > > > > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura < > > >> > > > ag...@apache.org > > >> > > > > > > > >> > > > > > > > > > >> > > > > > > > wrote: > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > Vyacheslav, > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > Exchange worker is strongly tied with > > >> > > > > > > > > > > > GridDhtPartitionExchangeFuture#init and it is > ok. > > >> > > Exchange > > >> > > > > worker > > >> > > > > > > > > > >> > > > > > > > also > > >> > > > > > > > > > > > shouldn't be blocked for long time but in > reality > > it > > >> > > > > happens.It > > >> > > > > > > > > > >> > > > > > > > also > > >> > > > > > > > > > > > means that your change doesn't make sense. > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > What actually make sense it is identification of > > >> places > > >> > > > which > > >> > > > > > > > > > > > intentionally blocking. May be some > places/actions > > >> > should > > >> > > > be > > >> > > > > > > > > > >> > > > > > > > braced by > > >> > > > > > > > > > > > blocking guards. > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > If you have failing tests please make sure that > > your > > >> > > > > > > > > > >> > > > > > > > failureHandler is > > >> > > > > > > > > > > > NoOpFailureHandler or any other handler with > > >> > > > > ignoreFailureTypes = > > >> > > > > > > > > > > > [CRITICAL_WORKER_BLOCKED]. > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav > > Daradur < > > >> > > > > > > > > > > > >> > > > > > > > > > daradu...@gmail.com> > > >> > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > Hi Igniters! > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > Thank you for this important improvement! > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > I've looked through implementation and noticed > > >> that > > >> > > > > > > > > > > > > GridDhtPartitionsExchangeFuture#init has not > > been > > >> > > wrapped > > >> > > > > in > > >> > > > > > > > > > >> > > > > > > > blocked > > >> > > > > > > > > > > > > section. This means it easy to halt the node > in > > >> case > > >> > of > > >> > > > > > > > > > >> > > > > > > > longrunning > > >> > > > > > > > > > > > > actions during PME, for example when we > create a > > >> > cache > > >> > > > with > > >> > > > > > > > > > > > > StoreFactrory which connect to 3rd party DB. > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > I'm not sure that it is the right behavior. > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > I filled the issue [1] and prepared the PR [2] > > >> with > > >> > > > > reproducer > > >> > > > > > > > > > >> > > > > > > > and > > >> > > > > > > > > > > > > >> > > > > > > > > > > possible fix. > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > Andrey, could you please look at and confirm > > that > > >> it > > >> > > > makes > > >> > > > > sense? > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > [1] > > >> > https://issues.apache.org/jira/browse/IGNITE-9710 > > >> > > > > > > > > > > > > [2] > https://github.com/apache/ignite/pull/4845 > > >> > > > > > > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey > > Kuznetsov < > > >> > > > > > > > > > >> > > > > > > > stku...@gmail.com> > > >> > > > > > > > > > > > > >> > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Denis, > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > I've created the ticket [1] with short > > >> description > > >> > of > > >> > > > the > > >> > > > > > > > > > > > > >> > > > > > > > > > > functionality. > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > [1] > > >> > > https://issues.apache.org/jira/browse/IGNITE-9679 > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda < > > >> > > > > dma...@apache.org>: > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Andrey K. and G., > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Thanks, do we have a documentation ticket > > >> > created? > > >> > > > > Prachi > > >> > > > > > > > > > > > >> > > > > > > > > > (copied) > > >> > > > > > > > > > > can help > > >> > > > > > > > > > > > > > > with the documentation. > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > -- > > >> > > > > > > > > > > > > > > Denis > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey > Gura > > < > > >> > > > > > > > > > >> > > > > > > > ag...@apache.org> > > >> > > > > > > > > > > > > >> > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Andrey, > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > finally your change is merged to master > > >> branch. > > >> > > > > > > > > > >> > > > > > > > Congratulations > > >> > > > > > > > > > > > > >> > > > > > > > > > > and > > >> > > > > > > > > > > > > > > > thank you very much! :) > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > I think that the next step is feature > that > > >> will > > >> > > > allow > > >> > > > > > > > > > >> > > > > > > > signal > > >> > > > > > > > > > > > > >> > > > > > > > > > > about > > >> > > > > > > > > > > > > > > > blocked threads to the monitoring tools > > via > > >> > > MXBean. > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > I hope you will continue development of > > this > > >> > > > feature > > >> > > > > and > > >> > > > > > > > > > > > >> > > > > > > > > > provide > > >> > > > > > > > > > > your > > >> > > > > > > > > > > > > > > > vision in new JIRA issue. > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey > > >> > Kuznetsov > > >> > > < > > >> > > > > > > > > > > > > >> > > > > > > > > > > stku...@gmail.com> > > >> > > > > > > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > David, Maxim! > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks a lot for you ideas. > > >> Unfortunately, I > > >> > > > can't > > >> > > > > adopt > > >> > > > > > > > > > >> > > > > > > > all > > >> > > > > > > > > > > > > >> > > > > > > > > > > of them > > >> > > > > > > > > > > > > > > > right > > >> > > > > > > > > > > > > > > > > now: the scope is much broader than > the > > >> scope > > >> > > of > > >> > > > > the > > >> > > > > > > > > > >> > > > > > > > change I > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > implement. > > >> > > > > > > > > > > > > > > > I > > >> > > > > > > > > > > > > > > > > have had a talk to a group of Ignite > > >> > commiters, > > >> > > > > and we > > >> > > > > > > > > > >> > > > > > > > agreed > > >> > > > > > > > > > > > > >> > > > > > > > > > > to > > >> > > > > > > > > > > > > > > complete > > >> > > > > > > > > > > > > > > > > the change as follows. > > >> > > > > > > > > > > > > > > > > - Blocking instructions in > > system-critical > > >> > > which > > >> > > > > may > > >> > > > > > > > > > > > >> > > > > > > > > > resonably > > >> > > > > > > > > > > last > > >> > > > > > > > > > > > > > > long > > >> > > > > > > > > > > > > > > > > should be explicitly excluded from the > > >> > > > monitoring. > > >> > > > > > > > > > > > > > > > > - Failure handlers should have a > setting > > >> to > > >> > > > > suppress some > > >> > > > > > > > > > > > > >> > > > > > > > > > > failures on > > >> > > > > > > > > > > > > > > > > per-failure-type basis. > > >> > > > > > > > > > > > > > > > > According to this I have updated the > > >> > > > > implementation: [1] > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > [1] > > >> > https://github.com/apache/ignite/pull/4089 > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David > > >> Harvey < > > >> > > > > > > > > > > > > >> > > > > > > > > > > syssoft...@gmail.com>: > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > When I've done this before,I've > needed > > >> to > > >> > > find > > >> > > > > the > > >> > > > > > > > > > >> > > > > > > > oldest > > >> > > > > > > > > > > > > >> > > > > > > > > > > thread, > > >> > > > > > > > > > > > > > > and > > >> > > > > > > > > > > > > > > > kill > > >> > > > > > > > > > > > > > > > > > the node running that. From a > > language > > >> > > > > standpoint, > > >> > > > > > > > > > > > >> > > > > > > > > > Maxim's > > >> > > > > > > > > > > "without > > >> > > > > > > > > > > > > > > > > > progress" better than "heartbeat". > > For > > >> > > > > example, what > > >> > > > > > > > > > >> > > > > > > > I'm > > >> > > > > > > > > > > > > >> > > > > > > > > > > most > > >> > > > > > > > > > > > > > > > interested > > >> > > > > > > > > > > > > > > > > > in on a distributed system is which > > >> thread > > >> > > > > started the > > >> > > > > > > > > > >> > > > > > > > work > > >> > > > > > > > > > > > > >> > > > > > > > > > > it has > > >> > > > > > > > > > > > > > > not > > >> > > > > > > > > > > > > > > > > > completed the earliest, and when did > > >> that > > >> > > > thread > > >> > > > > last > > >> > > > > > > > > > >> > > > > > > > make > > >> > > > > > > > > > > > > >> > > > > > > > > > > forward > > >> > > > > > > > > > > > > > > > > > process. You don't want to kill > a > > >> node > > >> > > > > because a > > >> > > > > > > > > > >> > > > > > > > thread > > >> > > > > > > > > > > > > >> > > > > > > > > > > is > > >> > > > > > > > > > > > > > > waiting > > >> > > > > > > > > > > > > > > > on a > > >> > > > > > > > > > > > > > > > > > lock held by a thread that went > > off-node > > >> > and > > >> > > > has > > >> > > > > not > > >> > > > > > > > > > > > >> > > > > > > > > > gotten a > > >> > > > > > > > > > > > > > > response. > > >> > > > > > > > > > > > > > > > > > If you don't understand the > dependency > > >> > > > > relationships, > > >> > > > > > > > > > >> > > > > > > > you > > >> > > > > > > > > > > > > >> > > > > > > > > > > will make > > >> > > > > > > > > > > > > > > > > > incorrect recovery decisions. > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM > Maxim > > >> > > > Muzafarov < > > >> > > > > > > > > > > > > >> > > > > > > > > > > maxmu...@gmail.com> > > >> > > > > > > > > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > I think we should find exact > answers > > >> to > > >> > > these > > >> > > > > > > > > > >> > > > > > > > questions: > > >> > > > > > > > > > > > > > > > > > > 1. What `critical` issue exactly > > is? > > >> > > > > > > > > > > > > > > > > > > 2. How can we find critical > issues? > > >> > > > > > > > > > > > > > > > > > > 3. How can we handle critical > > issues? > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > First, > > >> > > > > > > > > > > > > > > > > > > - Ignore uninterruptable actions > > >> (e.g. > > >> > > > > > > > > > >> > > > > > > > worker\service > > >> > > > > > > > > > > > > >> > > > > > > > > > > shutdown) > > >> > > > > > > > > > > > > > > > > > > - Long I/O operations (should be > a > > >> > > > > configurable > > >> > > > > > > > > > >> > > > > > > > timeout > > >> > > > > > > > > > > > > >> > > > > > > > > > > for each > > >> > > > > > > > > > > > > > > > type of > > >> > > > > > > > > > > > > > > > > > > usage) > > >> > > > > > > > > > > > > > > > > > > - Infinite loops > > >> > > > > > > > > > > > > > > > > > > - Stalled\deadlocked threads > > (and\or > > >> too > > >> > > > many > > >> > > > > parked > > >> > > > > > > > > > > > > >> > > > > > > > > > > threads, > > >> > > > > > > > > > > > > > > > exclude > > >> > > > > > > > > > > > > > > > > > I/O) > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Second, > > >> > > > > > > > > > > > > > > > > > > - The working queue is without > > >> progress > > >> > > > (e.g. > > >> > > > > disco, > > >> > > > > > > > > > > > > >> > > > > > > > > > > exchange > > >> > > > > > > > > > > > > > > > queues) > > >> > > > > > > > > > > > > > > > > > > - Work hasn't been completed > since > > >> the > > >> > > last > > >> > > > > > > > > > >> > > > > > > > heartbeat > > >> > > > > > > > > > > > > >> > > > > > > > > > > (checking > > >> > > > > > > > > > > > > > > > > > > milestones) > > >> > > > > > > > > > > > > > > > > > > - Too many system resources used > > by a > > >> > > thread > > >> > > > > for the > > >> > > > > > > > > > > > >> > > > > > > > > > long > > >> > > > > > > > > > > period > > >> > > > > > > > > > > > > > > of > > >> > > > > > > > > > > > > > > > time > > >> > > > > > > > > > > > > > > > > > > (allocated memory, CPU) > > >> > > > > > > > > > > > > > > > > > > - Timing fields associated with > > each > > >> > > thread > > >> > > > > status > > >> > > > > > > > > > > > > >> > > > > > > > > > > exceeded a > > >> > > > > > > > > > > > > > > > maximum > > >> > > > > > > > > > > > > > > > > > time > > >> > > > > > > > > > > > > > > > > > > limit. > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Third (not too many options here), > > >> > > > > > > > > > > > > > > > > > > - `log everything` should be the > > >> default > > >> > > > > behaviour > > >> > > > > > > > > > >> > > > > > > > in > > >> > > > > > > > > > > > >> > > > > > > > > > all > > >> > > > > > > > > > > these > > >> > > > > > > > > > > > > > > > cases, > > >> > > > > > > > > > > > > > > > > > > since it may be difficult to find > > the > > >> > cause > > >> > > > > after the > > >> > > > > > > > > > > > > >> > > > > > > > > > > restart. > > >> > > > > > > > > > > > > > > > > > > - Wait some interval of time and > > kill > > >> > the > > >> > > > > hanging > > >> > > > > > > > > > >> > > > > > > > node > > >> > > > > > > > > > > > > >> > > > > > > > > > > (cluster > > >> > > > > > > > > > > > > > > > should > > >> > > > > > > > > > > > > > > > > > be > > >> > > > > > > > > > > > > > > > > > > configured stable enough) > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Questions, > > >> > > > > > > > > > > > > > > > > > > - Not sure, but can workers miss > > >> their > > >> > > > > heartbeat > > >> > > > > > > > > > > > > >> > > > > > > > > > > deadlines if CPU > > >> > > > > > > > > > > > > > > > loads > > >> > > > > > > > > > > > > > > > > > up > > >> > > > > > > > > > > > > > > > > > > to 80%-90%? Bursts of momentary > > >> overloads > > >> > > can > > >> > > > > be > > >> > > > > > > > > > > > > > > > > > > expected behaviour as a normal > > >> part > > >> > of > > >> > > > > system > > >> > > > > > > > > > > > > >> > > > > > > > > > > operations. > > >> > > > > > > > > > > > > > > > > > > - Why do we decide that critical > > >> thread > > >> > > > should > > >> > > > > > > > > > >> > > > > > > > monitor > > >> > > > > > > > > > > > > >> > > > > > > > > > > each other? > > >> > > > > > > > > > > > > > > > For > > >> > > > > > > > > > > > > > > > > > > instance, if all the tasks were > > >> blocked > > >> > and > > >> > > > > unable to > > >> > > > > > > > > > > > >> > > > > > > > > > run, > > >> > > > > > > > > > > > > > > > > > > node reset would never occur. > As > > >> for > > >> > > me, > > >> > > > a > > >> > > > > better > > >> > > > > > > > > > > > > >> > > > > > > > > > > solution is > > >> > > > > > > > > > > > > > > to > > >> > > > > > > > > > > > > > > > use > > >> > > > > > > > > > > > > > > > > > a > > >> > > > > > > > > > > > > > > > > > > separate monitor thread or pool > > (maybe > > >> > both > > >> > > > > with > > >> > > > > > > > > > >> > > > > > > > software > > >> > > > > > > > > > > > > > > > > > > and hardware checks) that not > > only > > >> > > checks > > >> > > > > > > > > > >> > > > > > > > heartbeats > > >> > > > > > > > > > > > > >> > > > > > > > > > > but > > >> > > > > > > > > > > > > > > > monitors the > > >> > > > > > > > > > > > > > > > > > > other system as well. > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David > > >> > Harvey < > > >> > > > > > > > > > > > > >> > > > > > > > > > > syssoft...@gmail.com> > > >> > > > > > > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > It would be safer to restart the > > >> entire > > >> > > > > cluster > > >> > > > > > > > > > >> > > > > > > > than to > > >> > > > > > > > > > > > > >> > > > > > > > > > > remove > > >> > > > > > > > > > > > > > > the > > >> > > > > > > > > > > > > > > > last > > >> > > > > > > > > > > > > > > > > > > > node for a cache that should be > > >> > > redundant. > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM > > Andrey > > >> > Gura > > >> > > < > > >> > > > > > > > > > > > > >> > > > > > > > > > > ag...@apache.org> > > >> > > > > > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Hi, > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > I agree with Yakov that we can > > >> > provide > > >> > > > some > > >> > > > > > > > > > >> > > > > > > > option > > >> > > > > > > > > > > > > >> > > > > > > > > > > that manage > > >> > > > > > > > > > > > > > > > worker > > >> > > > > > > > > > > > > > > > > > > > > liveness checker behavior in > > case > > >> of > > >> > > > > observing > > >> > > > > > > > > > >> > > > > > > > that > > >> > > > > > > > > > > > > >> > > > > > > > > > > some worker > > >> > > > > > > > > > > > > > > > is > > >> > > > > > > > > > > > > > > > > > > > > blocked too long. > > >> > > > > > > > > > > > > > > > > > > > > At least it will some > > workaround > > >> for > > >> > > > > cases when > > >> > > > > > > > > > >> > > > > > > > node > > >> > > > > > > > > > > > > >> > > > > > > > > > > fails is > > >> > > > > > > > > > > > > > > > too > > >> > > > > > > > > > > > > > > > > > > > > annoying. > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Backups count threshold sounds > > >> good > > >> > > but I > > >> > > > > don't > > >> > > > > > > > > > > > > >> > > > > > > > > > > understand how > > >> > > > > > > > > > > > > > > it > > >> > > > > > > > > > > > > > > > > > will > > >> > > > > > > > > > > > > > > > > > > > > help in case of cluster > hanging. > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > The simplest solution here is > > >> alert > > >> > in > > >> > > > > cases of > > >> > > > > > > > > > > > > >> > > > > > > > > > > blocking of > > >> > > > > > > > > > > > > > > some > > >> > > > > > > > > > > > > > > > > > > > > critical worker (we can > improve > > >> > > > > WorkersRegistry > > >> > > > > > > > > > >> > > > > > > > for > > >> > > > > > > > > > > > > >> > > > > > > > > > > this > > >> > > > > > > > > > > > > > > purpose > > >> > > > > > > > > > > > > > > > and > > >> > > > > > > > > > > > > > > > > > > > > expose list of blocked > workers) > > >> and > > >> > > > > optionally > > >> > > > > > > > > > >> > > > > > > > call > > >> > > > > > > > > > > > > >> > > > > > > > > > > system > > >> > > > > > > > > > > > > > > > configured > > >> > > > > > > > > > > > > > > > > > > > > failure processor. BTW, > failure > > >> > > processor > > >> > > > > can be > > >> > > > > > > > > > > > > >> > > > > > > > > > > extended in > > >> > > > > > > > > > > > > > > > order to > > >> > > > > > > > > > > > > > > > > > > > > perform any checks (e.g. > backup > > >> > count) > > >> > > > and > > >> > > > > decide > > >> > > > > > > > > > > > > >> > > > > > > > > > > whether it > > >> > > > > > > > > > > > > > > > should > > >> > > > > > > > > > > > > > > > > > > > > stop node or not. > > >> > > > > > > > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM > > >> Andrey > > >> > > > > Kuznetsov < > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > stku...@gmail.com> > > >> > > > > > > > > > > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > David, Yakov, I understand > > your > > >> > > fears. > > >> > > > > But > > >> > > > > > > > > > >> > > > > > > > liveness > > >> > > > > > > > > > > > > >> > > > > > > > > > > checks > > >> > > > > > > > > > > > > > > deal > > >> > > > > > > > > > > > > > > > > > with > > >> > > > > > > > > > > > > > > > > > > > > > _critical_ conditions, i.e. > > when > > >> > > such a > > >> > > > > > > > > > >> > > > > > > > condition > > >> > > > > > > > > > > > >> > > > > > > > > > is > > >> > > > > > > > > > > met we > > >> > > > > > > > > > > > > > > > > > conclude > > >> > > > > > > > > > > > > > > > > > > > the > > >> > > > > > > > > > > > > > > > > > > > > > node as totally broken, and > > >> there > > >> > is > > >> > > no > > >> > > > > sense > > >> > > > > > > > > > >> > > > > > > > to > > >> > > > > > > > > > > > > >> > > > > > > > > > > keep it > > >> > > > > > > > > > > > > > > alive > > >> > > > > > > > > > > > > > > > > > > > regardless > > >> > > > > > > > > > > > > > > > > > > > > > the data it contains. If we > > >> want to > > >> > > > give > > >> > > > > it a > > >> > > > > > > > > > > > > >> > > > > > > > > > > chance, then > > >> > > > > > > > > > > > > > > the > > >> > > > > > > > > > > > > > > > > > > > condition > > >> > > > > > > > > > > > > > > > > > > > > > (long fsync etc.) should not > > >> > > considered > > >> > > > > as > > >> > > > > > > > > > >> > > > > > > > critical > > >> > > > > > > > > > > > > >> > > > > > > > > > > at all. > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, > > >> Yakov > > >> > > > > Zhdanov < > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > yzhda...@apache.org>: > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > Agree with David. We need > to > > >> have > > >> > > an > > >> > > > > > > > > > >> > > > > > > > opporunity > > >> > > > > > > > > > > > > >> > > > > > > > > > > set backups > > >> > > > > > > > > > > > > > > > count > > >> > > > > > > > > > > > > > > > > > > > > threshold > > >> > > > > > > > > > > > > > > > > > > > > > > (at runtime also!) that > will > > >> not > > >> > > > allow > > >> > > > > any > > >> > > > > > > > > > > > > >> > > > > > > > > > > automatic stop > > >> > > > > > > > > > > > > > > if > > >> > > > > > > > > > > > > > > > > > there > > >> > > > > > > > > > > > > > > > > > > > > will be > > >> > > > > > > > > > > > > > > > > > > > > > > a data loss. Andrey, what > do > > >> you > > >> > > > think? > > >> > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > --Yakov > > >> > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > -- > > >> > > > > > > > > > > > > > > > > > > > > > Best regards, > > >> > > > > > > > > > > > > > > > > > > > > > Andrey Kuznetsov. > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > -- > > >> > > > > > > > > > > > > > > > > > > -- > > >> > > > > > > > > > > > > > > > > > > Maxim Muzafarov > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > -- > > >> > > > > > > > > > > > > > > > > Best regards, > > >> > > > > > > > > > > > > > > > > Andrey Kuznetsov. > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > -- > > >> > > > > > > > > > > > > > Best regards, > > >> > > > > > > > > > > > > > Andrey Kuznetsov. > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > -- > > >> > > > > > > > > > > > > Best Regards, Vyacheslav D. > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > -- > > >> > > > > > > > > > > Best Regards, Vyacheslav D. > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > -- > > >> > > > > > > > > > -- > > >> > > > > > > > > > Maxim Muzafarov > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > -- > > >> > -- > > >> > Maxim Muzafarov > > >> > > > >> > > >> > > >> -- > > >> Best regards, > > >> Andrey Kuznetsov. > > >> > > > > > >