Re: Critical worker threads liveness checking drawbacks

Alexey Goncharuk Wed, 19 Dec 2018 01:55:30 -0800

Folks, why did not we include IGNITE-10003 to ignite-2.7 release scope?
This causes an Ignite node to be stopped by default when checkpoint read
lock acquire times out. I expect a lot of Ignite 2.7 users will be affected
by this mistake.


We should at least update the documentation and make users aware of a
workaround.

чт, 25 окт. 2018 г. в 16:35, Alexey Goncharuk <alexey.goncha...@gmail.com>:

> Andrey,
>
> I still see that checkpoint read lock acquisition raises a CRITICAL_ERROR,
> which by default will shut down local node. As far as I remember, we
> decided that by default thread timeout should not trigger node failure.
> Now, however, it does, because we ignore SYSTEM_WORKER_BLOCKED events in
> default configuration.
>
> Should we introduce another critical failure type
> CHECKPOINT_READ_LOCK_BLOCKED or use SYSTEM_WORKER_BLOCKED for checkpoint
> read lock acquire failure?
>
> --AG
>
> пт, 12 окт. 2018 г. в 8:29, Andrey Kuznetsov <stku...@gmail.com>:
>
>> Igniters,
>>
>> Now I spot blocking / long-running code arising from
>> {{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger
>> thread, see [1]. Ideally, all blocking operations along all possible code
>> paths should be guarded implicitly from critical failure detector to avoid
>> the thread from being considered blocked. There is a pull request [2] that
>> provides shallow solution. I didn't change code outside
>> {{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by any
>> upcoming change. Also, I didn't touch the code runnable by threads other
>> than partition-exchanger. So I have a number of guarded sections that are
>> wider than they could be, and this potentially hides issues from failure
>> detector. Does this PR make sense? Or maybe it's better to exclude
>> partition-exchanger from critical threads registry at all?
>>
>> [1] https://issues.apache.org/jira/browse/IGNITE-9710
>> [2] https://github.com/apache/ignite/pull/4962
>>
>>
>> пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov <maxmu...@gmail.com>:
>>
>> > Andrey, Andrey
>> >
>> > > Thanks for being attentive! It's definitely a typo. Could you please
>> > create
>> > > an issue?
>> >
>> > I've created an issue [1] and prepared PR [2].
>> > Please, review this change.
>> >
>> > [1] https://issues.apache.org/jira/browse/IGNITE-9723
>> > [2] https://github.com/apache/ignite/pull/4862
>> >
>> > On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov <yzhda...@apache.org> wrote:
>> >
>> > > Config option + mbean access. Does that make sense?
>> > >
>> > > Yakov
>> > >
>> > > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov <voze...@gridgain.com>
>> > wrote:
>> > >
>> > > > Then it should be config option.
>> > > >
>> > > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura <ag...@apache.org>:
>> > > >
>> > > > > Guys,
>> > > > >
>> > > > > why we need both config option and system property? I believe one
>> way
>> > > is
>> > > > > enough.
>> > > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <
>> > nizhi...@apache.org>
>> > > > > wrote:
>> > > > > >
>> > > > > > Ticket created -
>> https://issues.apache.org/jira/browse/IGNITE-9737
>> > > > > >
>> > > > > > Fixed version is 2.7.
>> > > > > >
>> > > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
>> > > > > > > Nikolay, I agree, a user should be able to disable both thread
>> > > > liveness
>> > > > > > > check and checkpoint read lock timeout check from config and a
>> > > system
>> > > > > > > property.
>> > > > > > >
>> > > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <
>> > nizhi...@apache.org
>> > > >:
>> > > > > > >
>> > > > > > > > Hello, Igniters.
>> > > > > > > >
>> > > > > > > > I found that this feature can't be disabled from config.
>> > > > > > > > The only way to disable it is from JMX bean.
>> > > > > > > >
>> > > > > > > > I think it very dangerous: If we have some corner case or a
>> bug
>> > > in
>> > > > > this
>> > > > > > > > Watch Dog it can make Ignite unusable.
>> > > > > > > > I propose to implement possibility to disable this feature
>> > both -
>> > > > > from
>> > > > > > > > config and from JVM options.
>> > > > > > > >
>> > > > > > > > What do you think?
>> > > > > > > >
>> > > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
>> > > > > > > > > Maxim,
>> > > > > > > > >
>> > > > > > > > > Thanks for being attentive! It's definitely a typo. Could
>> you
>> > > > > please
>> > > > > > > >
>> > > > > > > > create
>> > > > > > > > > an issue?
>> > > > > > > > >
>> > > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
>> > > > maxmu...@gmail.com
>> > > > > >:
>> > > > > > > > >
>> > > > > > > > > > Folks,
>> > > > > > > > > >
>> > > > > > > > > > I've found in `GridCachePartitionExchangeManager:2684`
>> [1]
>> > > > > (master
>> > > > > > > >
>> > > > > > > > branch)
>> > > > > > > > > > exchange future wrapped
>> > > > > > > > > > with double `blockingSectionEnd` method. Is it correct?
>> I
>> > > just
>> > > > > want to
>> > > > > > > > > > understand this change and
>> > > > > > > > > > how should I use this in the future.
>> > > > > > > > > >
>> > > > > > > > > > Should I file a new issue to fix this? I think here
>> > > > > > > >
>> > > > > > > > `blockingSectionBegin`
>> > > > > > > > > > method should be used.
>> > > > > > > > > >
>> > > > > > > > > > -------------
>> > > > > > > > > > blockingSectionEnd();
>> > > > > > > > > >
>> > > > > > > > > > try {
>> > > > > > > > > >     resVer = exchFut.get(exchTimeout,
>> > TimeUnit.MILLISECONDS);
>> > > > > > > > > > } finally {
>> > > > > > > > > >     blockingSectionEnd();
>> > > > > > > > > > }
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > [1]
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
>> > > > > > > > > >
>> > > > > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
>> > > > > daradu...@gmail.com>
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Andrey Gura, thank you for the answer!
>> > > > > > > > > > >
>> > > > > > > > > > > I agree that wrapping of 'init' method reduces the
>> profit
>> > > of
>> > > > > watchdog
>> > > > > > > > > > > service in case of PME worker, but in other cases, we
>> > > should
>> > > > > wrap all
>> > > > > > > > > > > possible long sections on
>> GridDhtPartitionExchangeFuture.
>> > > For
>> > > > > example
>> > > > > > > > > > > 'onCacheChangeRequest' method or
>> > > > > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside because
>> it
>> > > may
>> > > > > take
>> > > > > > > > > > > significant time (reproducer attached).
>> > > > > > > > > > >
>> > > > > > > > > > > I only want to point out a possible issue which may
>> allow
>> > > to
>> > > > > end-user
>> > > > > > > > > > > halt the Ignite cluster accidentally.
>> > > > > > > > > > >
>> > > > > > > > > > > I'm sure that PME experts know how to fix this issue
>> > > > properly.
>> > > > > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <
>> > > > ag...@apache.org
>> > > > > >
>> > > > > > > >
>> > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > Vyacheslav,
>> > > > > > > > > > > >
>> > > > > > > > > > > > Exchange worker is strongly tied with
>> > > > > > > > > > > > GridDhtPartitionExchangeFuture#init and it is ok.
>> > > Exchange
>> > > > > worker
>> > > > > > > >
>> > > > > > > > also
>> > > > > > > > > > > > shouldn't be blocked for long time but in reality it
>> > > > > happens.It
>> > > > > > > >
>> > > > > > > > also
>> > > > > > > > > > > > means that your change doesn't make sense.
>> > > > > > > > > > > >
>> > > > > > > > > > > > What actually make sense it is identification of
>> places
>> > > > which
>> > > > > > > > > > > > intentionally blocking. May be some places/actions
>> > should
>> > > > be
>> > > > > > > >
>> > > > > > > > braced by
>> > > > > > > > > > > > blocking guards.
>> > > > > > > > > > > >
>> > > > > > > > > > > > If you have failing tests please make sure that your
>> > > > > > > >
>> > > > > > > > failureHandler is
>> > > > > > > > > > > > NoOpFailureHandler or any other handler with
>> > > > > ignoreFailureTypes =
>> > > > > > > > > > > > [CRITICAL_WORKER_BLOCKED].
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
>> > > > > > > > > >
>> > > > > > > > > > daradu...@gmail.com>
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Hi Igniters!
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Thank you for this important improvement!
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I've looked through implementation and noticed
>> that
>> > > > > > > > > > > > > GridDhtPartitionsExchangeFuture#init has not been
>> > > wrapped
>> > > > > in
>> > > > > > > >
>> > > > > > > > blocked
>> > > > > > > > > > > > > section. This means it easy to halt the node in
>> case
>> > of
>> > > > > > > >
>> > > > > > > > longrunning
>> > > > > > > > > > > > > actions during PME, for example when we create a
>> > cache
>> > > > with
>> > > > > > > > > > > > > StoreFactrory which connect to 3rd party DB.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I'm not sure that it is the right behavior.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I filled the issue [1] and prepared the PR [2]
>> with
>> > > > > reproducer
>> > > > > > > >
>> > > > > > > > and
>> > > > > > > > > > >
>> > > > > > > > > > > possible fix.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Andrey, could you please look at and confirm that
>> it
>> > > > makes
>> > > > > sense?
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > [1]
>> > https://issues.apache.org/jira/browse/IGNITE-9710
>> > > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4845
>> > > > > > > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <
>> > > > > > > >
>> > > > > > > > stku...@gmail.com>
>> > > > > > > > > > >
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Denis,
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > I've created the ticket [1] with short
>> description
>> > of
>> > > > the
>> > > > > > > > > > >
>> > > > > > > > > > > functionality.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > [1]
>> > > https://issues.apache.org/jira/browse/IGNITE-9679
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <
>> > > > > dma...@apache.org>:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Andrey K. and G.,
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Thanks, do we have a documentation ticket
>> > created?
>> > > > > Prachi
>> > > > > > > > > >
>> > > > > > > > > > (copied)
>> > > > > > > > > > > can help
>> > > > > > > > > > > > > > > with the documentation.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > Denis
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <
>> > > > > > > >
>> > > > > > > > ag...@apache.org>
>> > > > > > > > > > >
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Andrey,
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > finally your change is merged to master
>> branch.
>> > > > > > > >
>> > > > > > > > Congratulations
>> > > > > > > > > > >
>> > > > > > > > > > > and
>> > > > > > > > > > > > > > > > thank you very much! :)
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > I think that the next step is feature that
>> will
>> > > > allow
>> > > > > > > >
>> > > > > > > > signal
>> > > > > > > > > > >
>> > > > > > > > > > > about
>> > > > > > > > > > > > > > > > blocked threads to the monitoring tools via
>> > > MXBean.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > I hope you will continue development of this
>> > > > feature
>> > > > > and
>> > > > > > > > > >
>> > > > > > > > > > provide
>> > > > > > > > > > > your
>> > > > > > > > > > > > > > > > vision in new JIRA issue.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey
>> > Kuznetsov
>> > > <
>> > > > > > > > > > >
>> > > > > > > > > > > stku...@gmail.com>
>> > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > David, Maxim!
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Thanks a lot for you ideas.
>> Unfortunately, I
>> > > > can't
>> > > > > adopt
>> > > > > > > >
>> > > > > > > > all
>> > > > > > > > > > >
>> > > > > > > > > > > of them
>> > > > > > > > > > > > > > > > right
>> > > > > > > > > > > > > > > > > now: the scope is much broader than the
>> scope
>> > > of
>> > > > > the
>> > > > > > > >
>> > > > > > > > change I
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > implement.
>> > > > > > > > > > > > > > > > I
>> > > > > > > > > > > > > > > > > have had a talk to a group of Ignite
>> > commiters,
>> > > > > and we
>> > > > > > > >
>> > > > > > > > agreed
>> > > > > > > > > > >
>> > > > > > > > > > > to
>> > > > > > > > > > > > > > > complete
>> > > > > > > > > > > > > > > > > the change as follows.
>> > > > > > > > > > > > > > > > > - Blocking instructions in system-critical
>> > > which
>> > > > > may
>> > > > > > > > > >
>> > > > > > > > > > resonably
>> > > > > > > > > > > last
>> > > > > > > > > > > > > > > long
>> > > > > > > > > > > > > > > > > should be explicitly excluded from the
>> > > > monitoring.
>> > > > > > > > > > > > > > > > > - Failure handlers should have a setting
>> to
>> > > > > suppress some
>> > > > > > > > > > >
>> > > > > > > > > > > failures on
>> > > > > > > > > > > > > > > > > per-failure-type basis.
>> > > > > > > > > > > > > > > > > According to this I have updated the
>> > > > > implementation: [1]
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > [1]
>> > https://github.com/apache/ignite/pull/4089
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David
>> Harvey <
>> > > > > > > > > > >
>> > > > > > > > > > > syssoft...@gmail.com>:
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > When I've done this before,I've needed
>> to
>> > > find
>> > > > > the
>> > > > > > > >
>> > > > > > > > oldest
>> > > > > > > > > > >
>> > > > > > > > > > > thread,
>> > > > > > > > > > > > > > > and
>> > > > > > > > > > > > > > > > kill
>> > > > > > > > > > > > > > > > > > the node running that.   From a language
>> > > > > standpoint,
>> > > > > > > > > >
>> > > > > > > > > > Maxim's
>> > > > > > > > > > > "without
>> > > > > > > > > > > > > > > > > > progress" better than "heartbeat".   For
>> > > > > example, what
>> > > > > > > >
>> > > > > > > > I'm
>> > > > > > > > > > >
>> > > > > > > > > > > most
>> > > > > > > > > > > > > > > > interested
>> > > > > > > > > > > > > > > > > > in on a distributed system is which
>> thread
>> > > > > started the
>> > > > > > > >
>> > > > > > > > work
>> > > > > > > > > > >
>> > > > > > > > > > > it has
>> > > > > > > > > > > > > > > not
>> > > > > > > > > > > > > > > > > > completed the earliest, and when did
>> that
>> > > > thread
>> > > > > last
>> > > > > > > >
>> > > > > > > > make
>> > > > > > > > > > >
>> > > > > > > > > > > forward
>> > > > > > > > > > > > > > > > > > process.     You don't want to kill a
>> node
>> > > > > because a
>> > > > > > > >
>> > > > > > > > thread
>> > > > > > > > > > >
>> > > > > > > > > > > is
>> > > > > > > > > > > > > > > waiting
>> > > > > > > > > > > > > > > > on a
>> > > > > > > > > > > > > > > > > > lock held by a thread that went off-node
>> > and
>> > > > has
>> > > > > not
>> > > > > > > > > >
>> > > > > > > > > > gotten a
>> > > > > > > > > > > > > > > response.
>> > > > > > > > > > > > > > > > > > If you don't understand the dependency
>> > > > > relationships,
>> > > > > > > >
>> > > > > > > > you
>> > > > > > > > > > >
>> > > > > > > > > > > will make
>> > > > > > > > > > > > > > > > > > incorrect recovery decisions.
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim
>> > > > Muzafarov <
>> > > > > > > > > > >
>> > > > > > > > > > > maxmu...@gmail.com>
>> > > > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > I think we should find exact answers
>> to
>> > > these
>> > > > > > > >
>> > > > > > > > questions:
>> > > > > > > > > > > > > > > > > > >  1. What `critical` issue exactly is?
>> > > > > > > > > > > > > > > > > > >  2. How can we find critical issues?
>> > > > > > > > > > > > > > > > > > >  3. How can we handle critical issues?
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > First,
>> > > > > > > > > > > > > > > > > > >  - Ignore uninterruptable actions
>> (e.g.
>> > > > > > > >
>> > > > > > > > worker\service
>> > > > > > > > > > >
>> > > > > > > > > > > shutdown)
>> > > > > > > > > > > > > > > > > > >  - Long I/O operations (should be a
>> > > > > configurable
>> > > > > > > >
>> > > > > > > > timeout
>> > > > > > > > > > >
>> > > > > > > > > > > for each
>> > > > > > > > > > > > > > > > type of
>> > > > > > > > > > > > > > > > > > > usage)
>> > > > > > > > > > > > > > > > > > >  - Infinite loops
>> > > > > > > > > > > > > > > > > > >  - Stalled\deadlocked threads (and\or
>> too
>> > > > many
>> > > > > parked
>> > > > > > > > > > >
>> > > > > > > > > > > threads,
>> > > > > > > > > > > > > > > > exclude
>> > > > > > > > > > > > > > > > > > I/O)
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Second,
>> > > > > > > > > > > > > > > > > > >  - The working queue is without
>> progress
>> > > > (e.g.
>> > > > > disco,
>> > > > > > > > > > >
>> > > > > > > > > > > exchange
>> > > > > > > > > > > > > > > > queues)
>> > > > > > > > > > > > > > > > > > >  - Work hasn't been completed since
>> the
>> > > last
>> > > > > > > >
>> > > > > > > > heartbeat
>> > > > > > > > > > >
>> > > > > > > > > > > (checking
>> > > > > > > > > > > > > > > > > > > milestones)
>> > > > > > > > > > > > > > > > > > >  - Too many system resources used by a
>> > > thread
>> > > > > for the
>> > > > > > > > > >
>> > > > > > > > > > long
>> > > > > > > > > > > period
>> > > > > > > > > > > > > > > of
>> > > > > > > > > > > > > > > > time
>> > > > > > > > > > > > > > > > > > > (allocated memory, CPU)
>> > > > > > > > > > > > > > > > > > >  - Timing fields associated with each
>> > > thread
>> > > > > status
>> > > > > > > > > > >
>> > > > > > > > > > > exceeded a
>> > > > > > > > > > > > > > > > maximum
>> > > > > > > > > > > > > > > > > > time
>> > > > > > > > > > > > > > > > > > > limit.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Third (not too many options here),
>> > > > > > > > > > > > > > > > > > >  - `log everything` should be the
>> default
>> > > > > behaviour
>> > > > > > > >
>> > > > > > > > in
>> > > > > > > > > >
>> > > > > > > > > > all
>> > > > > > > > > > > these
>> > > > > > > > > > > > > > > > cases,
>> > > > > > > > > > > > > > > > > > > since it may be difficult to find the
>> > cause
>> > > > > after the
>> > > > > > > > > > >
>> > > > > > > > > > > restart.
>> > > > > > > > > > > > > > > > > > >  - Wait some interval of time and kill
>> > the
>> > > > > hanging
>> > > > > > > >
>> > > > > > > > node
>> > > > > > > > > > >
>> > > > > > > > > > > (cluster
>> > > > > > > > > > > > > > > > should
>> > > > > > > > > > > > > > > > > > be
>> > > > > > > > > > > > > > > > > > > configured stable enough)
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Questions,
>> > > > > > > > > > > > > > > > > > >  - Not sure, but can workers miss
>> their
>> > > > > heartbeat
>> > > > > > > > > > >
>> > > > > > > > > > > deadlines if CPU
>> > > > > > > > > > > > > > > > loads
>> > > > > > > > > > > > > > > > > > up
>> > > > > > > > > > > > > > > > > > > to 80%-90%? Bursts of momentary
>> overloads
>> > > can
>> > > > > be
>> > > > > > > > > > > > > > > > > > >     expected behaviour as a normal
>> part
>> > of
>> > > > > system
>> > > > > > > > > > >
>> > > > > > > > > > > operations.
>> > > > > > > > > > > > > > > > > > >  - Why do we decide that critical
>> thread
>> > > > should
>> > > > > > > >
>> > > > > > > > monitor
>> > > > > > > > > > >
>> > > > > > > > > > > each other?
>> > > > > > > > > > > > > > > > For
>> > > > > > > > > > > > > > > > > > > instance, if all the tasks were
>> blocked
>> > and
>> > > > > unable to
>> > > > > > > > > >
>> > > > > > > > > > run,
>> > > > > > > > > > > > > > > > > > >     node reset would never occur. As
>> for
>> > > me,
>> > > > a
>> > > > > better
>> > > > > > > > > > >
>> > > > > > > > > > > solution is
>> > > > > > > > > > > > > > > to
>> > > > > > > > > > > > > > > > use
>> > > > > > > > > > > > > > > > > > a
>> > > > > > > > > > > > > > > > > > > separate monitor thread or pool (maybe
>> > both
>> > > > > with
>> > > > > > > >
>> > > > > > > > software
>> > > > > > > > > > > > > > > > > > >     and hardware checks) that not only
>> > > checks
>> > > > > > > >
>> > > > > > > > heartbeats
>> > > > > > > > > > >
>> > > > > > > > > > > but
>> > > > > > > > > > > > > > > > monitors the
>> > > > > > > > > > > > > > > > > > > other system as well.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David
>> > Harvey <
>> > > > > > > > > > >
>> > > > > > > > > > > syssoft...@gmail.com>
>> > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > It would be safer to restart the
>> entire
>> > > > > cluster
>> > > > > > > >
>> > > > > > > > than to
>> > > > > > > > > > >
>> > > > > > > > > > > remove
>> > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > last
>> > > > > > > > > > > > > > > > > > > > node for a cache that should be
>> > > redundant.
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey
>> > Gura
>> > > <
>> > > > > > > > > > >
>> > > > > > > > > > > ag...@apache.org>
>> > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Hi,
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > I agree with Yakov that we can
>> > provide
>> > > > some
>> > > > > > > >
>> > > > > > > > option
>> > > > > > > > > > >
>> > > > > > > > > > > that manage
>> > > > > > > > > > > > > > > > worker
>> > > > > > > > > > > > > > > > > > > > > liveness checker behavior in case
>> of
>> > > > > observing
>> > > > > > > >
>> > > > > > > > that
>> > > > > > > > > > >
>> > > > > > > > > > > some worker
>> > > > > > > > > > > > > > > > is
>> > > > > > > > > > > > > > > > > > > > > blocked too long.
>> > > > > > > > > > > > > > > > > > > > > At least it will  some workaround
>> for
>> > > > > cases when
>> > > > > > > >
>> > > > > > > > node
>> > > > > > > > > > >
>> > > > > > > > > > > fails is
>> > > > > > > > > > > > > > > > too
>> > > > > > > > > > > > > > > > > > > > > annoying.
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Backups count threshold sounds
>> good
>> > > but I
>> > > > > don't
>> > > > > > > > > > >
>> > > > > > > > > > > understand how
>> > > > > > > > > > > > > > > it
>> > > > > > > > > > > > > > > > > > will
>> > > > > > > > > > > > > > > > > > > > > help in case of cluster hanging.
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > The simplest solution here is
>> alert
>> > in
>> > > > > cases of
>> > > > > > > > > > >
>> > > > > > > > > > > blocking of
>> > > > > > > > > > > > > > > some
>> > > > > > > > > > > > > > > > > > > > > critical worker (we can improve
>> > > > > WorkersRegistry
>> > > > > > > >
>> > > > > > > > for
>> > > > > > > > > > >
>> > > > > > > > > > > this
>> > > > > > > > > > > > > > > purpose
>> > > > > > > > > > > > > > > > and
>> > > > > > > > > > > > > > > > > > > > > expose list of blocked workers)
>> and
>> > > > > optionally
>> > > > > > > >
>> > > > > > > > call
>> > > > > > > > > > >
>> > > > > > > > > > > system
>> > > > > > > > > > > > > > > > configured
>> > > > > > > > > > > > > > > > > > > > > failure processor. BTW, failure
>> > > processor
>> > > > > can be
>> > > > > > > > > > >
>> > > > > > > > > > > extended in
>> > > > > > > > > > > > > > > > order to
>> > > > > > > > > > > > > > > > > > > > > perform any checks (e.g. backup
>> > count)
>> > > > and
>> > > > > decide
>> > > > > > > > > > >
>> > > > > > > > > > > whether it
>> > > > > > > > > > > > > > > > should
>> > > > > > > > > > > > > > > > > > > > > stop node or not.
>> > > > > > > > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM
>> Andrey
>> > > > > Kuznetsov <
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > stku...@gmail.com>
>> > > > > > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > David, Yakov, I understand your
>> > > fears.
>> > > > > But
>> > > > > > > >
>> > > > > > > > liveness
>> > > > > > > > > > >
>> > > > > > > > > > > checks
>> > > > > > > > > > > > > > > deal
>> > > > > > > > > > > > > > > > > > with
>> > > > > > > > > > > > > > > > > > > > > > _critical_ conditions, i.e. when
>> > > such a
>> > > > > > > >
>> > > > > > > > condition
>> > > > > > > > > >
>> > > > > > > > > > is
>> > > > > > > > > > > met we
>> > > > > > > > > > > > > > > > > > conclude
>> > > > > > > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > > > > > node as totally broken, and
>> there
>> > is
>> > > no
>> > > > > sense
>> > > > > > > >
>> > > > > > > > to
>> > > > > > > > > > >
>> > > > > > > > > > > keep it
>> > > > > > > > > > > > > > > alive
>> > > > > > > > > > > > > > > > > > > > regardless
>> > > > > > > > > > > > > > > > > > > > > > the data it contains. If we
>> want to
>> > > > give
>> > > > > it a
>> > > > > > > > > > >
>> > > > > > > > > > > chance, then
>> > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > > > condition
>> > > > > > > > > > > > > > > > > > > > > > (long fsync etc.) should not
>> > > considered
>> > > > > as
>> > > > > > > >
>> > > > > > > > critical
>> > > > > > > > > > >
>> > > > > > > > > > > at all.
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18,
>> Yakov
>> > > > > Zhdanov <
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > yzhda...@apache.org>:
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > > Agree with David. We need to
>> have
>> > > an
>> > > > > > > >
>> > > > > > > > opporunity
>> > > > > > > > > > >
>> > > > > > > > > > > set backups
>> > > > > > > > > > > > > > > > count
>> > > > > > > > > > > > > > > > > > > > > threshold
>> > > > > > > > > > > > > > > > > > > > > > > (at runtime also!) that will
>> not
>> > > > allow
>> > > > > any
>> > > > > > > > > > >
>> > > > > > > > > > > automatic stop
>> > > > > > > > > > > > > > > if
>> > > > > > > > > > > > > > > > > > there
>> > > > > > > > > > > > > > > > > > > > > will be
>> > > > > > > > > > > > > > > > > > > > > > > a data loss. Andrey, what do
>> you
>> > > > think?
>> > > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > > --Yakov
>> > > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > > > > > > > Best regards,
>> > > > > > > > > > > > > > > > > > > > > >   Andrey Kuznetsov.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > > > > Maxim Muzafarov
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > > Best regards,
>> > > > > > > > > > > > > > > > >   Andrey Kuznetsov.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > Best regards,
>> > > > > > > > > > > > > >   Andrey Kuznetsov.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > --
>> > > > > > > > > > > > > Best Regards, Vyacheslav D.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > --
>> > > > > > > > > > > Best Regards, Vyacheslav D.
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > --
>> > > > > > > > > > --
>> > > > > > > > > > Maxim Muzafarov
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > >
>> > > >
>> > >
>> > --
>> > --
>> > Maxim Muzafarov
>> >
>>
>>
>> --
>> Best regards,
>>   Andrey Kuznetsov.
>>
>

Re: Critical worker threads liveness checking drawbacks

Reply via email to