Folks, why did not we include IGNITE-10003 to ignite-2.7 release scope? This causes an Ignite node to be stopped by default when checkpoint read lock acquire times out. I expect a lot of Ignite 2.7 users will be affected by this mistake.
We should at least update the documentation and make users aware of a workaround. чт, 25 окт. 2018 г. в 16:35, Alexey Goncharuk <alexey.goncha...@gmail.com>: > Andrey, > > I still see that checkpoint read lock acquisition raises a CRITICAL_ERROR, > which by default will shut down local node. As far as I remember, we > decided that by default thread timeout should not trigger node failure. > Now, however, it does, because we ignore SYSTEM_WORKER_BLOCKED events in > default configuration. > > Should we introduce another critical failure type > CHECKPOINT_READ_LOCK_BLOCKED or use SYSTEM_WORKER_BLOCKED for checkpoint > read lock acquire failure? > > --AG > > пт, 12 окт. 2018 г. в 8:29, Andrey Kuznetsov <stku...@gmail.com>: > >> Igniters, >> >> Now I spot blocking / long-running code arising from >> {{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger >> thread, see [1]. Ideally, all blocking operations along all possible code >> paths should be guarded implicitly from critical failure detector to avoid >> the thread from being considered blocked. There is a pull request [2] that >> provides shallow solution. I didn't change code outside >> {{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by any >> upcoming change. Also, I didn't touch the code runnable by threads other >> than partition-exchanger. So I have a number of guarded sections that are >> wider than they could be, and this potentially hides issues from failure >> detector. Does this PR make sense? Or maybe it's better to exclude >> partition-exchanger from critical threads registry at all? >> >> [1] https://issues.apache.org/jira/browse/IGNITE-9710 >> [2] https://github.com/apache/ignite/pull/4962 >> >> >> пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov <maxmu...@gmail.com>: >> >> > Andrey, Andrey >> > >> > > Thanks for being attentive! It's definitely a typo. Could you please >> > create >> > > an issue? >> > >> > I've created an issue [1] and prepared PR [2]. >> > Please, review this change. >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-9723 >> > [2] https://github.com/apache/ignite/pull/4862 >> > >> > On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov <yzhda...@apache.org> wrote: >> > >> > > Config option + mbean access. Does that make sense? >> > > >> > > Yakov >> > > >> > > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov <voze...@gridgain.com> >> > wrote: >> > > >> > > > Then it should be config option. >> > > > >> > > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura <ag...@apache.org>: >> > > > >> > > > > Guys, >> > > > > >> > > > > why we need both config option and system property? I believe one >> way >> > > is >> > > > > enough. >> > > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov < >> > nizhi...@apache.org> >> > > > > wrote: >> > > > > > >> > > > > > Ticket created - >> https://issues.apache.org/jira/browse/IGNITE-9737 >> > > > > > >> > > > > > Fixed version is 2.7. >> > > > > > >> > > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет: >> > > > > > > Nikolay, I agree, a user should be able to disable both thread >> > > > liveness >> > > > > > > check and checkpoint read lock timeout check from config and a >> > > system >> > > > > > > property. >> > > > > > > >> > > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov < >> > nizhi...@apache.org >> > > >: >> > > > > > > >> > > > > > > > Hello, Igniters. >> > > > > > > > >> > > > > > > > I found that this feature can't be disabled from config. >> > > > > > > > The only way to disable it is from JMX bean. >> > > > > > > > >> > > > > > > > I think it very dangerous: If we have some corner case or a >> bug >> > > in >> > > > > this >> > > > > > > > Watch Dog it can make Ignite unusable. >> > > > > > > > I propose to implement possibility to disable this feature >> > both - >> > > > > from >> > > > > > > > config and from JVM options. >> > > > > > > > >> > > > > > > > What do you think? >> > > > > > > > >> > > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет: >> > > > > > > > > Maxim, >> > > > > > > > > >> > > > > > > > > Thanks for being attentive! It's definitely a typo. Could >> you >> > > > > please >> > > > > > > > >> > > > > > > > create >> > > > > > > > > an issue? >> > > > > > > > > >> > > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov < >> > > > maxmu...@gmail.com >> > > > > >: >> > > > > > > > > >> > > > > > > > > > Folks, >> > > > > > > > > > >> > > > > > > > > > I've found in `GridCachePartitionExchangeManager:2684` >> [1] >> > > > > (master >> > > > > > > > >> > > > > > > > branch) >> > > > > > > > > > exchange future wrapped >> > > > > > > > > > with double `blockingSectionEnd` method. Is it correct? >> I >> > > just >> > > > > want to >> > > > > > > > > > understand this change and >> > > > > > > > > > how should I use this in the future. >> > > > > > > > > > >> > > > > > > > > > Should I file a new issue to fix this? I think here >> > > > > > > > >> > > > > > > > `blockingSectionBegin` >> > > > > > > > > > method should be used. >> > > > > > > > > > >> > > > > > > > > > ------------- >> > > > > > > > > > blockingSectionEnd(); >> > > > > > > > > > >> > > > > > > > > > try { >> > > > > > > > > > resVer = exchFut.get(exchTimeout, >> > TimeUnit.MILLISECONDS); >> > > > > > > > > > } finally { >> > > > > > > > > > blockingSectionEnd(); >> > > > > > > > > > } >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > [1] >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684 >> > > > > > > > > > >> > > > > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur < >> > > > > daradu...@gmail.com> >> > > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > Andrey Gura, thank you for the answer! >> > > > > > > > > > > >> > > > > > > > > > > I agree that wrapping of 'init' method reduces the >> profit >> > > of >> > > > > watchdog >> > > > > > > > > > > service in case of PME worker, but in other cases, we >> > > should >> > > > > wrap all >> > > > > > > > > > > possible long sections on >> GridDhtPartitionExchangeFuture. >> > > For >> > > > > example >> > > > > > > > > > > 'onCacheChangeRequest' method or >> > > > > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside because >> it >> > > may >> > > > > take >> > > > > > > > > > > significant time (reproducer attached). >> > > > > > > > > > > >> > > > > > > > > > > I only want to point out a possible issue which may >> allow >> > > to >> > > > > end-user >> > > > > > > > > > > halt the Ignite cluster accidentally. >> > > > > > > > > > > >> > > > > > > > > > > I'm sure that PME experts know how to fix this issue >> > > > properly. >> > > > > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura < >> > > > ag...@apache.org >> > > > > > >> > > > > > > > >> > > > > > > > wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > Vyacheslav, >> > > > > > > > > > > > >> > > > > > > > > > > > Exchange worker is strongly tied with >> > > > > > > > > > > > GridDhtPartitionExchangeFuture#init and it is ok. >> > > Exchange >> > > > > worker >> > > > > > > > >> > > > > > > > also >> > > > > > > > > > > > shouldn't be blocked for long time but in reality it >> > > > > happens.It >> > > > > > > > >> > > > > > > > also >> > > > > > > > > > > > means that your change doesn't make sense. >> > > > > > > > > > > > >> > > > > > > > > > > > What actually make sense it is identification of >> places >> > > > which >> > > > > > > > > > > > intentionally blocking. May be some places/actions >> > should >> > > > be >> > > > > > > > >> > > > > > > > braced by >> > > > > > > > > > > > blocking guards. >> > > > > > > > > > > > >> > > > > > > > > > > > If you have failing tests please make sure that your >> > > > > > > > >> > > > > > > > failureHandler is >> > > > > > > > > > > > NoOpFailureHandler or any other handler with >> > > > > ignoreFailureTypes = >> > > > > > > > > > > > [CRITICAL_WORKER_BLOCKED]. >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur < >> > > > > > > > > > >> > > > > > > > > > daradu...@gmail.com> >> > > > > > > > > > > wrote: >> > > > > > > > > > > > > >> > > > > > > > > > > > > Hi Igniters! >> > > > > > > > > > > > > >> > > > > > > > > > > > > Thank you for this important improvement! >> > > > > > > > > > > > > >> > > > > > > > > > > > > I've looked through implementation and noticed >> that >> > > > > > > > > > > > > GridDhtPartitionsExchangeFuture#init has not been >> > > wrapped >> > > > > in >> > > > > > > > >> > > > > > > > blocked >> > > > > > > > > > > > > section. This means it easy to halt the node in >> case >> > of >> > > > > > > > >> > > > > > > > longrunning >> > > > > > > > > > > > > actions during PME, for example when we create a >> > cache >> > > > with >> > > > > > > > > > > > > StoreFactrory which connect to 3rd party DB. >> > > > > > > > > > > > > >> > > > > > > > > > > > > I'm not sure that it is the right behavior. >> > > > > > > > > > > > > >> > > > > > > > > > > > > I filled the issue [1] and prepared the PR [2] >> with >> > > > > reproducer >> > > > > > > > >> > > > > > > > and >> > > > > > > > > > > >> > > > > > > > > > > possible fix. >> > > > > > > > > > > > > >> > > > > > > > > > > > > Andrey, could you please look at and confirm that >> it >> > > > makes >> > > > > sense? >> > > > > > > > > > > > > >> > > > > > > > > > > > > [1] >> > https://issues.apache.org/jira/browse/IGNITE-9710 >> > > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4845 >> > > > > > > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov < >> > > > > > > > >> > > > > > > > stku...@gmail.com> >> > > > > > > > > > > >> > > > > > > > > > > wrote: >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Denis, >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > I've created the ticket [1] with short >> description >> > of >> > > > the >> > > > > > > > > > > >> > > > > > > > > > > functionality. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > [1] >> > > https://issues.apache.org/jira/browse/IGNITE-9679 >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda < >> > > > > dma...@apache.org>: >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Andrey K. and G., >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Thanks, do we have a documentation ticket >> > created? >> > > > > Prachi >> > > > > > > > > > >> > > > > > > > > > (copied) >> > > > > > > > > > > can help >> > > > > > > > > > > > > > > with the documentation. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > -- >> > > > > > > > > > > > > > > Denis >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura < >> > > > > > > > >> > > > > > > > ag...@apache.org> >> > > > > > > > > > > >> > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Andrey, >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > finally your change is merged to master >> branch. >> > > > > > > > >> > > > > > > > Congratulations >> > > > > > > > > > > >> > > > > > > > > > > and >> > > > > > > > > > > > > > > > thank you very much! :) >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > I think that the next step is feature that >> will >> > > > allow >> > > > > > > > >> > > > > > > > signal >> > > > > > > > > > > >> > > > > > > > > > > about >> > > > > > > > > > > > > > > > blocked threads to the monitoring tools via >> > > MXBean. >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > I hope you will continue development of this >> > > > feature >> > > > > and >> > > > > > > > > > >> > > > > > > > > > provide >> > > > > > > > > > > your >> > > > > > > > > > > > > > > > vision in new JIRA issue. >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey >> > Kuznetsov >> > > < >> > > > > > > > > > > >> > > > > > > > > > > stku...@gmail.com> >> > > > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > David, Maxim! >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks a lot for you ideas. >> Unfortunately, I >> > > > can't >> > > > > adopt >> > > > > > > > >> > > > > > > > all >> > > > > > > > > > > >> > > > > > > > > > > of them >> > > > > > > > > > > > > > > > right >> > > > > > > > > > > > > > > > > now: the scope is much broader than the >> scope >> > > of >> > > > > the >> > > > > > > > >> > > > > > > > change I >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > implement. >> > > > > > > > > > > > > > > > I >> > > > > > > > > > > > > > > > > have had a talk to a group of Ignite >> > commiters, >> > > > > and we >> > > > > > > > >> > > > > > > > agreed >> > > > > > > > > > > >> > > > > > > > > > > to >> > > > > > > > > > > > > > > complete >> > > > > > > > > > > > > > > > > the change as follows. >> > > > > > > > > > > > > > > > > - Blocking instructions in system-critical >> > > which >> > > > > may >> > > > > > > > > > >> > > > > > > > > > resonably >> > > > > > > > > > > last >> > > > > > > > > > > > > > > long >> > > > > > > > > > > > > > > > > should be explicitly excluded from the >> > > > monitoring. >> > > > > > > > > > > > > > > > > - Failure handlers should have a setting >> to >> > > > > suppress some >> > > > > > > > > > > >> > > > > > > > > > > failures on >> > > > > > > > > > > > > > > > > per-failure-type basis. >> > > > > > > > > > > > > > > > > According to this I have updated the >> > > > > implementation: [1] >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > [1] >> > https://github.com/apache/ignite/pull/4089 >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David >> Harvey < >> > > > > > > > > > > >> > > > > > > > > > > syssoft...@gmail.com>: >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > When I've done this before,I've needed >> to >> > > find >> > > > > the >> > > > > > > > >> > > > > > > > oldest >> > > > > > > > > > > >> > > > > > > > > > > thread, >> > > > > > > > > > > > > > > and >> > > > > > > > > > > > > > > > kill >> > > > > > > > > > > > > > > > > > the node running that. From a language >> > > > > standpoint, >> > > > > > > > > > >> > > > > > > > > > Maxim's >> > > > > > > > > > > "without >> > > > > > > > > > > > > > > > > > progress" better than "heartbeat". For >> > > > > example, what >> > > > > > > > >> > > > > > > > I'm >> > > > > > > > > > > >> > > > > > > > > > > most >> > > > > > > > > > > > > > > > interested >> > > > > > > > > > > > > > > > > > in on a distributed system is which >> thread >> > > > > started the >> > > > > > > > >> > > > > > > > work >> > > > > > > > > > > >> > > > > > > > > > > it has >> > > > > > > > > > > > > > > not >> > > > > > > > > > > > > > > > > > completed the earliest, and when did >> that >> > > > thread >> > > > > last >> > > > > > > > >> > > > > > > > make >> > > > > > > > > > > >> > > > > > > > > > > forward >> > > > > > > > > > > > > > > > > > process. You don't want to kill a >> node >> > > > > because a >> > > > > > > > >> > > > > > > > thread >> > > > > > > > > > > >> > > > > > > > > > > is >> > > > > > > > > > > > > > > waiting >> > > > > > > > > > > > > > > > on a >> > > > > > > > > > > > > > > > > > lock held by a thread that went off-node >> > and >> > > > has >> > > > > not >> > > > > > > > > > >> > > > > > > > > > gotten a >> > > > > > > > > > > > > > > response. >> > > > > > > > > > > > > > > > > > If you don't understand the dependency >> > > > > relationships, >> > > > > > > > >> > > > > > > > you >> > > > > > > > > > > >> > > > > > > > > > > will make >> > > > > > > > > > > > > > > > > > incorrect recovery decisions. >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim >> > > > Muzafarov < >> > > > > > > > > > > >> > > > > > > > > > > maxmu...@gmail.com> >> > > > > > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > I think we should find exact answers >> to >> > > these >> > > > > > > > >> > > > > > > > questions: >> > > > > > > > > > > > > > > > > > > 1. What `critical` issue exactly is? >> > > > > > > > > > > > > > > > > > > 2. How can we find critical issues? >> > > > > > > > > > > > > > > > > > > 3. How can we handle critical issues? >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > First, >> > > > > > > > > > > > > > > > > > > - Ignore uninterruptable actions >> (e.g. >> > > > > > > > >> > > > > > > > worker\service >> > > > > > > > > > > >> > > > > > > > > > > shutdown) >> > > > > > > > > > > > > > > > > > > - Long I/O operations (should be a >> > > > > configurable >> > > > > > > > >> > > > > > > > timeout >> > > > > > > > > > > >> > > > > > > > > > > for each >> > > > > > > > > > > > > > > > type of >> > > > > > > > > > > > > > > > > > > usage) >> > > > > > > > > > > > > > > > > > > - Infinite loops >> > > > > > > > > > > > > > > > > > > - Stalled\deadlocked threads (and\or >> too >> > > > many >> > > > > parked >> > > > > > > > > > > >> > > > > > > > > > > threads, >> > > > > > > > > > > > > > > > exclude >> > > > > > > > > > > > > > > > > > I/O) >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Second, >> > > > > > > > > > > > > > > > > > > - The working queue is without >> progress >> > > > (e.g. >> > > > > disco, >> > > > > > > > > > > >> > > > > > > > > > > exchange >> > > > > > > > > > > > > > > > queues) >> > > > > > > > > > > > > > > > > > > - Work hasn't been completed since >> the >> > > last >> > > > > > > > >> > > > > > > > heartbeat >> > > > > > > > > > > >> > > > > > > > > > > (checking >> > > > > > > > > > > > > > > > > > > milestones) >> > > > > > > > > > > > > > > > > > > - Too many system resources used by a >> > > thread >> > > > > for the >> > > > > > > > > > >> > > > > > > > > > long >> > > > > > > > > > > period >> > > > > > > > > > > > > > > of >> > > > > > > > > > > > > > > > time >> > > > > > > > > > > > > > > > > > > (allocated memory, CPU) >> > > > > > > > > > > > > > > > > > > - Timing fields associated with each >> > > thread >> > > > > status >> > > > > > > > > > > >> > > > > > > > > > > exceeded a >> > > > > > > > > > > > > > > > maximum >> > > > > > > > > > > > > > > > > > time >> > > > > > > > > > > > > > > > > > > limit. >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Third (not too many options here), >> > > > > > > > > > > > > > > > > > > - `log everything` should be the >> default >> > > > > behaviour >> > > > > > > > >> > > > > > > > in >> > > > > > > > > > >> > > > > > > > > > all >> > > > > > > > > > > these >> > > > > > > > > > > > > > > > cases, >> > > > > > > > > > > > > > > > > > > since it may be difficult to find the >> > cause >> > > > > after the >> > > > > > > > > > > >> > > > > > > > > > > restart. >> > > > > > > > > > > > > > > > > > > - Wait some interval of time and kill >> > the >> > > > > hanging >> > > > > > > > >> > > > > > > > node >> > > > > > > > > > > >> > > > > > > > > > > (cluster >> > > > > > > > > > > > > > > > should >> > > > > > > > > > > > > > > > > > be >> > > > > > > > > > > > > > > > > > > configured stable enough) >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Questions, >> > > > > > > > > > > > > > > > > > > - Not sure, but can workers miss >> their >> > > > > heartbeat >> > > > > > > > > > > >> > > > > > > > > > > deadlines if CPU >> > > > > > > > > > > > > > > > loads >> > > > > > > > > > > > > > > > > > up >> > > > > > > > > > > > > > > > > > > to 80%-90%? Bursts of momentary >> overloads >> > > can >> > > > > be >> > > > > > > > > > > > > > > > > > > expected behaviour as a normal >> part >> > of >> > > > > system >> > > > > > > > > > > >> > > > > > > > > > > operations. >> > > > > > > > > > > > > > > > > > > - Why do we decide that critical >> thread >> > > > should >> > > > > > > > >> > > > > > > > monitor >> > > > > > > > > > > >> > > > > > > > > > > each other? >> > > > > > > > > > > > > > > > For >> > > > > > > > > > > > > > > > > > > instance, if all the tasks were >> blocked >> > and >> > > > > unable to >> > > > > > > > > > >> > > > > > > > > > run, >> > > > > > > > > > > > > > > > > > > node reset would never occur. As >> for >> > > me, >> > > > a >> > > > > better >> > > > > > > > > > > >> > > > > > > > > > > solution is >> > > > > > > > > > > > > > > to >> > > > > > > > > > > > > > > > use >> > > > > > > > > > > > > > > > > > a >> > > > > > > > > > > > > > > > > > > separate monitor thread or pool (maybe >> > both >> > > > > with >> > > > > > > > >> > > > > > > > software >> > > > > > > > > > > > > > > > > > > and hardware checks) that not only >> > > checks >> > > > > > > > >> > > > > > > > heartbeats >> > > > > > > > > > > >> > > > > > > > > > > but >> > > > > > > > > > > > > > > > monitors the >> > > > > > > > > > > > > > > > > > > other system as well. >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David >> > Harvey < >> > > > > > > > > > > >> > > > > > > > > > > syssoft...@gmail.com> >> > > > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > It would be safer to restart the >> entire >> > > > > cluster >> > > > > > > > >> > > > > > > > than to >> > > > > > > > > > > >> > > > > > > > > > > remove >> > > > > > > > > > > > > > > the >> > > > > > > > > > > > > > > > last >> > > > > > > > > > > > > > > > > > > > node for a cache that should be >> > > redundant. >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey >> > Gura >> > > < >> > > > > > > > > > > >> > > > > > > > > > > ag...@apache.org> >> > > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Hi, >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > I agree with Yakov that we can >> > provide >> > > > some >> > > > > > > > >> > > > > > > > option >> > > > > > > > > > > >> > > > > > > > > > > that manage >> > > > > > > > > > > > > > > > worker >> > > > > > > > > > > > > > > > > > > > > liveness checker behavior in case >> of >> > > > > observing >> > > > > > > > >> > > > > > > > that >> > > > > > > > > > > >> > > > > > > > > > > some worker >> > > > > > > > > > > > > > > > is >> > > > > > > > > > > > > > > > > > > > > blocked too long. >> > > > > > > > > > > > > > > > > > > > > At least it will some workaround >> for >> > > > > cases when >> > > > > > > > >> > > > > > > > node >> > > > > > > > > > > >> > > > > > > > > > > fails is >> > > > > > > > > > > > > > > > too >> > > > > > > > > > > > > > > > > > > > > annoying. >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Backups count threshold sounds >> good >> > > but I >> > > > > don't >> > > > > > > > > > > >> > > > > > > > > > > understand how >> > > > > > > > > > > > > > > it >> > > > > > > > > > > > > > > > > > will >> > > > > > > > > > > > > > > > > > > > > help in case of cluster hanging. >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > The simplest solution here is >> alert >> > in >> > > > > cases of >> > > > > > > > > > > >> > > > > > > > > > > blocking of >> > > > > > > > > > > > > > > some >> > > > > > > > > > > > > > > > > > > > > critical worker (we can improve >> > > > > WorkersRegistry >> > > > > > > > >> > > > > > > > for >> > > > > > > > > > > >> > > > > > > > > > > this >> > > > > > > > > > > > > > > purpose >> > > > > > > > > > > > > > > > and >> > > > > > > > > > > > > > > > > > > > > expose list of blocked workers) >> and >> > > > > optionally >> > > > > > > > >> > > > > > > > call >> > > > > > > > > > > >> > > > > > > > > > > system >> > > > > > > > > > > > > > > > configured >> > > > > > > > > > > > > > > > > > > > > failure processor. BTW, failure >> > > processor >> > > > > can be >> > > > > > > > > > > >> > > > > > > > > > > extended in >> > > > > > > > > > > > > > > > order to >> > > > > > > > > > > > > > > > > > > > > perform any checks (e.g. backup >> > count) >> > > > and >> > > > > decide >> > > > > > > > > > > >> > > > > > > > > > > whether it >> > > > > > > > > > > > > > > > should >> > > > > > > > > > > > > > > > > > > > > stop node or not. >> > > > > > > > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM >> Andrey >> > > > > Kuznetsov < >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > stku...@gmail.com> >> > > > > > > > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > David, Yakov, I understand your >> > > fears. >> > > > > But >> > > > > > > > >> > > > > > > > liveness >> > > > > > > > > > > >> > > > > > > > > > > checks >> > > > > > > > > > > > > > > deal >> > > > > > > > > > > > > > > > > > with >> > > > > > > > > > > > > > > > > > > > > > _critical_ conditions, i.e. when >> > > such a >> > > > > > > > >> > > > > > > > condition >> > > > > > > > > > >> > > > > > > > > > is >> > > > > > > > > > > met we >> > > > > > > > > > > > > > > > > > conclude >> > > > > > > > > > > > > > > > > > > > the >> > > > > > > > > > > > > > > > > > > > > > node as totally broken, and >> there >> > is >> > > no >> > > > > sense >> > > > > > > > >> > > > > > > > to >> > > > > > > > > > > >> > > > > > > > > > > keep it >> > > > > > > > > > > > > > > alive >> > > > > > > > > > > > > > > > > > > > regardless >> > > > > > > > > > > > > > > > > > > > > > the data it contains. If we >> want to >> > > > give >> > > > > it a >> > > > > > > > > > > >> > > > > > > > > > > chance, then >> > > > > > > > > > > > > > > the >> > > > > > > > > > > > > > > > > > > > condition >> > > > > > > > > > > > > > > > > > > > > > (long fsync etc.) should not >> > > considered >> > > > > as >> > > > > > > > >> > > > > > > > critical >> > > > > > > > > > > >> > > > > > > > > > > at all. >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, >> Yakov >> > > > > Zhdanov < >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > yzhda...@apache.org>: >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > Agree with David. We need to >> have >> > > an >> > > > > > > > >> > > > > > > > opporunity >> > > > > > > > > > > >> > > > > > > > > > > set backups >> > > > > > > > > > > > > > > > count >> > > > > > > > > > > > > > > > > > > > > threshold >> > > > > > > > > > > > > > > > > > > > > > > (at runtime also!) that will >> not >> > > > allow >> > > > > any >> > > > > > > > > > > >> > > > > > > > > > > automatic stop >> > > > > > > > > > > > > > > if >> > > > > > > > > > > > > > > > > > there >> > > > > > > > > > > > > > > > > > > > > will be >> > > > > > > > > > > > > > > > > > > > > > > a data loss. Andrey, what do >> you >> > > > think? >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > --Yakov >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > -- >> > > > > > > > > > > > > > > > > > > > > > Best regards, >> > > > > > > > > > > > > > > > > > > > > > Andrey Kuznetsov. >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > -- >> > > > > > > > > > > > > > > > > > > -- >> > > > > > > > > > > > > > > > > > > Maxim Muzafarov >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > -- >> > > > > > > > > > > > > > > > > Best regards, >> > > > > > > > > > > > > > > > > Andrey Kuznetsov. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > -- >> > > > > > > > > > > > > > Best regards, >> > > > > > > > > > > > > > Andrey Kuznetsov. >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > -- >> > > > > > > > > > > > > Best Regards, Vyacheslav D. >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > -- >> > > > > > > > > > > Best Regards, Vyacheslav D. >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > -- >> > > > > > > > > > -- >> > > > > > > > > > Maxim Muzafarov >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > >> > > > >> > > >> > -- >> > -- >> > Maxim Muzafarov >> > >> >> >> -- >> Best regards, >> Andrey Kuznetsov. >> >