Re: Critical worker threads liveness checking drawbacks

Dmitriy Pavlov Wed, 19 Dec 2018 02:31:55 -0800

Hi,

Sorry for being too formal here, but IGNITE-10003
<https://issues.apache.org/jira/browse/IGNITE-10003> is in progress.


Also, I've tried to find anything related to it in the list. So according
to the list, no one was asking to include.

Sincerely,
Dmitriy Pavlov

ср, 19 дек. 2018 г. в 13:24, Nikolay Izhikov <[email protected]>:

> Hello, Alexey.
>
> No, we don't include this ticket to 2.7.
> Should we?
>
> ср, 19 дек. 2018 г. в 12:55, Alexey Goncharuk <[email protected]
> >:
>
> > Folks, why did not we include IGNITE-10003 to ignite-2.7 release scope?
> > This causes an Ignite node to be stopped by default when checkpoint read
> > lock acquire times out. I expect a lot of Ignite 2.7 users will be
> affected
> > by this mistake.
> >
> > We should at least update the documentation and make users aware of a
> > workaround.
> >
> > чт, 25 окт. 2018 г. в 16:35, Alexey Goncharuk <
> [email protected]
> > >:
> >
> > > Andrey,
> > >
> > > I still see that checkpoint read lock acquisition raises a
> > CRITICAL_ERROR,
> > > which by default will shut down local node. As far as I remember, we
> > > decided that by default thread timeout should not trigger node failure.
> > > Now, however, it does, because we ignore SYSTEM_WORKER_BLOCKED events
> in
> > > default configuration.
> > >
> > > Should we introduce another critical failure type
> > > CHECKPOINT_READ_LOCK_BLOCKED or use SYSTEM_WORKER_BLOCKED for
> checkpoint
> > > read lock acquire failure?
> > >
> > > --AG
> > >
> > > пт, 12 окт. 2018 г. в 8:29, Andrey Kuznetsov <[email protected]>:
> > >
> > >> Igniters,
> > >>
> > >> Now I spot blocking / long-running code arising from
> > >> {{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger
> > >> thread, see [1]. Ideally, all blocking operations along all possible
> > code
> > >> paths should be guarded implicitly from critical failure detector to
> > avoid
> > >> the thread from being considered blocked. There is a pull request [2]
> > that
> > >> provides shallow solution. I didn't change code outside
> > >> {{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by
> any
> > >> upcoming change. Also, I didn't touch the code runnable by threads
> other
> > >> than partition-exchanger. So I have a number of guarded sections that
> > are
> > >> wider than they could be, and this potentially hides issues from
> failure
> > >> detector. Does this PR make sense? Or maybe it's better to exclude
> > >> partition-exchanger from critical threads registry at all?
> > >>
> > >> [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > >> [2] https://github.com/apache/ignite/pull/4962
> > >>
> > >>
> > >> пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov <[email protected]>:
> > >>
> > >> > Andrey, Andrey
> > >> >
> > >> > > Thanks for being attentive! It's definitely a typo. Could you
> please
> > >> > create
> > >> > > an issue?
> > >> >
> > >> > I've created an issue [1] and prepared PR [2].
> > >> > Please, review this change.
> > >> >
> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-9723
> > >> > [2] https://github.com/apache/ignite/pull/4862
> > >> >
> > >> > On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov <[email protected]>
> > wrote:
> > >> >
> > >> > > Config option + mbean access. Does that make sense?
> > >> > >
> > >> > > Yakov
> > >> > >
> > >> > > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov <[email protected]
> >
> > >> > wrote:
> > >> > >
> > >> > > > Then it should be config option.
> > >> > > >
> > >> > > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura <[email protected]>:
> > >> > > >
> > >> > > > > Guys,
> > >> > > > >
> > >> > > > > why we need both config option and system property? I believe
> > one
> > >> way
> > >> > > is
> > >> > > > > enough.
> > >> > > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <
> > >> > [email protected]>
> > >> > > > > wrote:
> > >> > > > > >
> > >> > > > > > Ticket created -
> > >> https://issues.apache.org/jira/browse/IGNITE-9737
> > >> > > > > >
> > >> > > > > > Fixed version is 2.7.
> > >> > > > > >
> > >> > > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > >> > > > > > > Nikolay, I agree, a user should be able to disable both
> > thread
> > >> > > > liveness
> > >> > > > > > > check and checkpoint read lock timeout check from config
> > and a
> > >> > > system
> > >> > > > > > > property.
> > >> > > > > > >
> > >> > > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <
> > >> > [email protected]
> > >> > > >:
> > >> > > > > > >
> > >> > > > > > > > Hello, Igniters.
> > >> > > > > > > >
> > >> > > > > > > > I found that this feature can't be disabled from config.
> > >> > > > > > > > The only way to disable it is from JMX bean.
> > >> > > > > > > >
> > >> > > > > > > > I think it very dangerous: If we have some corner case
> or
> > a
> > >> bug
> > >> > > in
> > >> > > > > this
> > >> > > > > > > > Watch Dog it can make Ignite unusable.
> > >> > > > > > > > I propose to implement possibility to disable this
> feature
> > >> > both -
> > >> > > > > from
> > >> > > > > > > > config and from JVM options.
> > >> > > > > > > >
> > >> > > > > > > > What do you think?
> > >> > > > > > > >
> > >> > > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > >> > > > > > > > > Maxim,
> > >> > > > > > > > >
> > >> > > > > > > > > Thanks for being attentive! It's definitely a typo.
> > Could
> > >> you
> > >> > > > > please
> > >> > > > > > > >
> > >> > > > > > > > create
> > >> > > > > > > > > an issue?
> > >> > > > > > > > >
> > >> > > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
> > >> > > > [email protected]
> > >> > > > > >:
> > >> > > > > > > > >
> > >> > > > > > > > > > Folks,
> > >> > > > > > > > > >
> > >> > > > > > > > > > I've found in
> `GridCachePartitionExchangeManager:2684`
> > >> [1]
> > >> > > > > (master
> > >> > > > > > > >
> > >> > > > > > > > branch)
> > >> > > > > > > > > > exchange future wrapped
> > >> > > > > > > > > > with double `blockingSectionEnd` method. Is it
> > correct?
> > >> I
> > >> > > just
> > >> > > > > want to
> > >> > > > > > > > > > understand this change and
> > >> > > > > > > > > > how should I use this in the future.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Should I file a new issue to fix this? I think here
> > >> > > > > > > >
> > >> > > > > > > > `blockingSectionBegin`
> > >> > > > > > > > > > method should be used.
> > >> > > > > > > > > >
> > >> > > > > > > > > > -------------
> > >> > > > > > > > > > blockingSectionEnd();
> > >> > > > > > > > > >
> > >> > > > > > > > > > try {
> > >> > > > > > > > > >     resVer = exchFut.get(exchTimeout,
> > >> > TimeUnit.MILLISECONDS);
> > >> > > > > > > > > > } finally {
> > >> > > > > > > > > >     blockingSectionEnd();
> > >> > > > > > > > > > }
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > > [1]
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
> > >> > > > > [email protected]>
> > >> > > > > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > Andrey Gura, thank you for the answer!
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > I agree that wrapping of 'init' method reduces the
> > >> profit
> > >> > > of
> > >> > > > > watchdog
> > >> > > > > > > > > > > service in case of PME worker, but in other cases,
> > we
> > >> > > should
> > >> > > > > wrap all
> > >> > > > > > > > > > > possible long sections on
> > >> GridDhtPartitionExchangeFuture.
> > >> > > For
> > >> > > > > example
> > >> > > > > > > > > > > 'onCacheChangeRequest' method or
> > >> > > > > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside
> > because
> > >> it
> > >> > > may
> > >> > > > > take
> > >> > > > > > > > > > > significant time (reproducer attached).
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > I only want to point out a possible issue which
> may
> > >> allow
> > >> > > to
> > >> > > > > end-user
> > >> > > > > > > > > > > halt the Ignite cluster accidentally.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > I'm sure that PME experts know how to fix this
> issue
> > >> > > > properly.
> > >> > > > > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <
> > >> > > > [email protected]
> > >> > > > > >
> > >> > > > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Vyacheslav,
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Exchange worker is strongly tied with
> > >> > > > > > > > > > > > GridDhtPartitionExchangeFuture#init and it is
> ok.
> > >> > > Exchange
> > >> > > > > worker
> > >> > > > > > > >
> > >> > > > > > > > also
> > >> > > > > > > > > > > > shouldn't be blocked for long time but in
> reality
> > it
> > >> > > > > happens.It
> > >> > > > > > > >
> > >> > > > > > > > also
> > >> > > > > > > > > > > > means that your change doesn't make sense.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > What actually make sense it is identification of
> > >> places
> > >> > > > which
> > >> > > > > > > > > > > > intentionally blocking. May be some
> places/actions
> > >> > should
> > >> > > > be
> > >> > > > > > > >
> > >> > > > > > > > braced by
> > >> > > > > > > > > > > > blocking guards.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > If you have failing tests please make sure that
> > your
> > >> > > > > > > >
> > >> > > > > > > > failureHandler is
> > >> > > > > > > > > > > > NoOpFailureHandler or any other handler with
> > >> > > > > ignoreFailureTypes =
> > >> > > > > > > > > > > > [CRITICAL_WORKER_BLOCKED].
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav
> > Daradur <
> > >> > > > > > > > > >
> > >> > > > > > > > > > [email protected]>
> > >> > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Hi Igniters!
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Thank you for this important improvement!
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > I've looked through implementation and noticed
> > >> that
> > >> > > > > > > > > > > > > GridDhtPartitionsExchangeFuture#init has not
> > been
> > >> > > wrapped
> > >> > > > > in
> > >> > > > > > > >
> > >> > > > > > > > blocked
> > >> > > > > > > > > > > > > section. This means it easy to halt the node
> in
> > >> case
> > >> > of
> > >> > > > > > > >
> > >> > > > > > > > longrunning
> > >> > > > > > > > > > > > > actions during PME, for example when we
> create a
> > >> > cache
> > >> > > > with
> > >> > > > > > > > > > > > > StoreFactrory which connect to 3rd party DB.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > I'm not sure that it is the right behavior.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > I filled the issue [1] and prepared the PR [2]
> > >> with
> > >> > > > > reproducer
> > >> > > > > > > >
> > >> > > > > > > > and
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > possible fix.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Andrey, could you please look at and confirm
> > that
> > >> it
> > >> > > > makes
> > >> > > > > sense?
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > [1]
> > >> > https://issues.apache.org/jira/browse/IGNITE-9710
> > >> > > > > > > > > > > > > [2]
> https://github.com/apache/ignite/pull/4845
> > >> > > > > > > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey
> > Kuznetsov <
> > >> > > > > > > >
> > >> > > > > > > > [email protected]>
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Denis,
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > I've created the ticket [1] with short
> > >> description
> > >> > of
> > >> > > > the
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > functionality.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > [1]
> > >> > > https://issues.apache.org/jira/browse/IGNITE-9679
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <
> > >> > > > > [email protected]>:
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Andrey K. and G.,
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Thanks, do we have a documentation ticket
> > >> > created?
> > >> > > > > Prachi
> > >> > > > > > > > > >
> > >> > > > > > > > > > (copied)
> > >> > > > > > > > > > > can help
> > >> > > > > > > > > > > > > > > with the documentation.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > Denis
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey
> Gura
> > <
> > >> > > > > > > >
> > >> > > > > > > > [email protected]>
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Andrey,
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > finally your change is merged to master
> > >> branch.
> > >> > > > > > > >
> > >> > > > > > > > Congratulations
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > and
> > >> > > > > > > > > > > > > > > > thank you very much! :)
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > I think that the next step is feature
> that
> > >> will
> > >> > > > allow
> > >> > > > > > > >
> > >> > > > > > > > signal
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > about
> > >> > > > > > > > > > > > > > > > blocked threads to the monitoring tools
> > via
> > >> > > MXBean.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > I hope you will continue development of
> > this
> > >> > > > feature
> > >> > > > > and
> > >> > > > > > > > > >
> > >> > > > > > > > > > provide
> > >> > > > > > > > > > > your
> > >> > > > > > > > > > > > > > > > vision in new JIRA issue.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey
> > >> > Kuznetsov
> > >> > > <
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [email protected]>
> > >> > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > David, Maxim!
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > Thanks a lot for you ideas.
> > >> Unfortunately, I
> > >> > > > can't
> > >> > > > > adopt
> > >> > > > > > > >
> > >> > > > > > > > all
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > of them
> > >> > > > > > > > > > > > > > > > right
> > >> > > > > > > > > > > > > > > > > now: the scope is much broader than
> the
> > >> scope
> > >> > > of
> > >> > > > > the
> > >> > > > > > > >
> > >> > > > > > > > change I
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > implement.
> > >> > > > > > > > > > > > > > > > I
> > >> > > > > > > > > > > > > > > > > have had a talk to a group of Ignite
> > >> > commiters,
> > >> > > > > and we
> > >> > > > > > > >
> > >> > > > > > > > agreed
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > to
> > >> > > > > > > > > > > > > > > complete
> > >> > > > > > > > > > > > > > > > > the change as follows.
> > >> > > > > > > > > > > > > > > > > - Blocking instructions in
> > system-critical
> > >> > > which
> > >> > > > > may
> > >> > > > > > > > > >
> > >> > > > > > > > > > resonably
> > >> > > > > > > > > > > last
> > >> > > > > > > > > > > > > > > long
> > >> > > > > > > > > > > > > > > > > should be explicitly excluded from the
> > >> > > > monitoring.
> > >> > > > > > > > > > > > > > > > > - Failure handlers should have a
> setting
> > >> to
> > >> > > > > suppress some
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > failures on
> > >> > > > > > > > > > > > > > > > > per-failure-type basis.
> > >> > > > > > > > > > > > > > > > > According to this I have updated the
> > >> > > > > implementation: [1]
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > [1]
> > >> > https://github.com/apache/ignite/pull/4089
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David
> > >> Harvey <
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [email protected]>:
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > When I've done this before,I've
> needed
> > >> to
> > >> > > find
> > >> > > > > the
> > >> > > > > > > >
> > >> > > > > > > > oldest
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > thread,
> > >> > > > > > > > > > > > > > > and
> > >> > > > > > > > > > > > > > > > kill
> > >> > > > > > > > > > > > > > > > > > the node running that.   From a
> > language
> > >> > > > > standpoint,
> > >> > > > > > > > > >
> > >> > > > > > > > > > Maxim's
> > >> > > > > > > > > > > "without
> > >> > > > > > > > > > > > > > > > > > progress" better than "heartbeat".
> >  For
> > >> > > > > example, what
> > >> > > > > > > >
> > >> > > > > > > > I'm
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > most
> > >> > > > > > > > > > > > > > > > interested
> > >> > > > > > > > > > > > > > > > > > in on a distributed system is which
> > >> thread
> > >> > > > > started the
> > >> > > > > > > >
> > >> > > > > > > > work
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > it has
> > >> > > > > > > > > > > > > > > not
> > >> > > > > > > > > > > > > > > > > > completed the earliest, and when did
> > >> that
> > >> > > > thread
> > >> > > > > last
> > >> > > > > > > >
> > >> > > > > > > > make
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > forward
> > >> > > > > > > > > > > > > > > > > > process.     You don't want to kill
> a
> > >> node
> > >> > > > > because a
> > >> > > > > > > >
> > >> > > > > > > > thread
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > is
> > >> > > > > > > > > > > > > > > waiting
> > >> > > > > > > > > > > > > > > > on a
> > >> > > > > > > > > > > > > > > > > > lock held by a thread that went
> > off-node
> > >> > and
> > >> > > > has
> > >> > > > > not
> > >> > > > > > > > > >
> > >> > > > > > > > > > gotten a
> > >> > > > > > > > > > > > > > > response.
> > >> > > > > > > > > > > > > > > > > > If you don't understand the
> dependency
> > >> > > > > relationships,
> > >> > > > > > > >
> > >> > > > > > > > you
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > will make
> > >> > > > > > > > > > > > > > > > > > incorrect recovery decisions.
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM
> Maxim
> > >> > > > Muzafarov <
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [email protected]>
> > >> > > > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > I think we should find exact
> answers
> > >> to
> > >> > > these
> > >> > > > > > > >
> > >> > > > > > > > questions:
> > >> > > > > > > > > > > > > > > > > > >  1. What `critical` issue exactly
> > is?
> > >> > > > > > > > > > > > > > > > > > >  2. How can we find critical
> issues?
> > >> > > > > > > > > > > > > > > > > > >  3. How can we handle critical
> > issues?
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > First,
> > >> > > > > > > > > > > > > > > > > > >  - Ignore uninterruptable actions
> > >> (e.g.
> > >> > > > > > > >
> > >> > > > > > > > worker\service
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > shutdown)
> > >> > > > > > > > > > > > > > > > > > >  - Long I/O operations (should be
> a
> > >> > > > > configurable
> > >> > > > > > > >
> > >> > > > > > > > timeout
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > for each
> > >> > > > > > > > > > > > > > > > type of
> > >> > > > > > > > > > > > > > > > > > > usage)
> > >> > > > > > > > > > > > > > > > > > >  - Infinite loops
> > >> > > > > > > > > > > > > > > > > > >  - Stalled\deadlocked threads
> > (and\or
> > >> too
> > >> > > > many
> > >> > > > > parked
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > threads,
> > >> > > > > > > > > > > > > > > > exclude
> > >> > > > > > > > > > > > > > > > > > I/O)
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > Second,
> > >> > > > > > > > > > > > > > > > > > >  - The working queue is without
> > >> progress
> > >> > > > (e.g.
> > >> > > > > disco,
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > exchange
> > >> > > > > > > > > > > > > > > > queues)
> > >> > > > > > > > > > > > > > > > > > >  - Work hasn't been completed
> since
> > >> the
> > >> > > last
> > >> > > > > > > >
> > >> > > > > > > > heartbeat
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > (checking
> > >> > > > > > > > > > > > > > > > > > > milestones)
> > >> > > > > > > > > > > > > > > > > > >  - Too many system resources used
> > by a
> > >> > > thread
> > >> > > > > for the
> > >> > > > > > > > > >
> > >> > > > > > > > > > long
> > >> > > > > > > > > > > period
> > >> > > > > > > > > > > > > > > of
> > >> > > > > > > > > > > > > > > > time
> > >> > > > > > > > > > > > > > > > > > > (allocated memory, CPU)
> > >> > > > > > > > > > > > > > > > > > >  - Timing fields associated with
> > each
> > >> > > thread
> > >> > > > > status
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > exceeded a
> > >> > > > > > > > > > > > > > > > maximum
> > >> > > > > > > > > > > > > > > > > > time
> > >> > > > > > > > > > > > > > > > > > > limit.
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > Third (not too many options here),
> > >> > > > > > > > > > > > > > > > > > >  - `log everything` should be the
> > >> default
> > >> > > > > behaviour
> > >> > > > > > > >
> > >> > > > > > > > in
> > >> > > > > > > > > >
> > >> > > > > > > > > > all
> > >> > > > > > > > > > > these
> > >> > > > > > > > > > > > > > > > cases,
> > >> > > > > > > > > > > > > > > > > > > since it may be difficult to find
> > the
> > >> > cause
> > >> > > > > after the
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > restart.
> > >> > > > > > > > > > > > > > > > > > >  - Wait some interval of time and
> > kill
> > >> > the
> > >> > > > > hanging
> > >> > > > > > > >
> > >> > > > > > > > node
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > (cluster
> > >> > > > > > > > > > > > > > > > should
> > >> > > > > > > > > > > > > > > > > > be
> > >> > > > > > > > > > > > > > > > > > > configured stable enough)
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > Questions,
> > >> > > > > > > > > > > > > > > > > > >  - Not sure, but can workers miss
> > >> their
> > >> > > > > heartbeat
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > deadlines if CPU
> > >> > > > > > > > > > > > > > > > loads
> > >> > > > > > > > > > > > > > > > > > up
> > >> > > > > > > > > > > > > > > > > > > to 80%-90%? Bursts of momentary
> > >> overloads
> > >> > > can
> > >> > > > > be
> > >> > > > > > > > > > > > > > > > > > >     expected behaviour as a normal
> > >> part
> > >> > of
> > >> > > > > system
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > operations.
> > >> > > > > > > > > > > > > > > > > > >  - Why do we decide that critical
> > >> thread
> > >> > > > should
> > >> > > > > > > >
> > >> > > > > > > > monitor
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > each other?
> > >> > > > > > > > > > > > > > > > For
> > >> > > > > > > > > > > > > > > > > > > instance, if all the tasks were
> > >> blocked
> > >> > and
> > >> > > > > unable to
> > >> > > > > > > > > >
> > >> > > > > > > > > > run,
> > >> > > > > > > > > > > > > > > > > > >     node reset would never occur.
> As
> > >> for
> > >> > > me,
> > >> > > > a
> > >> > > > > better
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > solution is
> > >> > > > > > > > > > > > > > > to
> > >> > > > > > > > > > > > > > > > use
> > >> > > > > > > > > > > > > > > > > > a
> > >> > > > > > > > > > > > > > > > > > > separate monitor thread or pool
> > (maybe
> > >> > both
> > >> > > > > with
> > >> > > > > > > >
> > >> > > > > > > > software
> > >> > > > > > > > > > > > > > > > > > >     and hardware checks) that not
> > only
> > >> > > checks
> > >> > > > > > > >
> > >> > > > > > > > heartbeats
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > but
> > >> > > > > > > > > > > > > > > > monitors the
> > >> > > > > > > > > > > > > > > > > > > other system as well.
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David
> > >> > Harvey <
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [email protected]>
> > >> > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > It would be safer to restart the
> > >> entire
> > >> > > > > cluster
> > >> > > > > > > >
> > >> > > > > > > > than to
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > remove
> > >> > > > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > > last
> > >> > > > > > > > > > > > > > > > > > > > node for a cache that should be
> > >> > > redundant.
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM
> > Andrey
> > >> > Gura
> > >> > > <
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [email protected]>
> > >> > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > Hi,
> > >> > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > I agree with Yakov that we can
> > >> > provide
> > >> > > > some
> > >> > > > > > > >
> > >> > > > > > > > option
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > that manage
> > >> > > > > > > > > > > > > > > > worker
> > >> > > > > > > > > > > > > > > > > > > > > liveness checker behavior in
> > case
> > >> of
> > >> > > > > observing
> > >> > > > > > > >
> > >> > > > > > > > that
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > some worker
> > >> > > > > > > > > > > > > > > > is
> > >> > > > > > > > > > > > > > > > > > > > > blocked too long.
> > >> > > > > > > > > > > > > > > > > > > > > At least it will  some
> > workaround
> > >> for
> > >> > > > > cases when
> > >> > > > > > > >
> > >> > > > > > > > node
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > fails is
> > >> > > > > > > > > > > > > > > > too
> > >> > > > > > > > > > > > > > > > > > > > > annoying.
> > >> > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > Backups count threshold sounds
> > >> good
> > >> > > but I
> > >> > > > > don't
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > understand how
> > >> > > > > > > > > > > > > > > it
> > >> > > > > > > > > > > > > > > > > > will
> > >> > > > > > > > > > > > > > > > > > > > > help in case of cluster
> hanging.
> > >> > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > The simplest solution here is
> > >> alert
> > >> > in
> > >> > > > > cases of
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > blocking of
> > >> > > > > > > > > > > > > > > some
> > >> > > > > > > > > > > > > > > > > > > > > critical worker (we can
> improve
> > >> > > > > WorkersRegistry
> > >> > > > > > > >
> > >> > > > > > > > for
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > this
> > >> > > > > > > > > > > > > > > purpose
> > >> > > > > > > > > > > > > > > > and
> > >> > > > > > > > > > > > > > > > > > > > > expose list of blocked
> workers)
> > >> and
> > >> > > > > optionally
> > >> > > > > > > >
> > >> > > > > > > > call
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > system
> > >> > > > > > > > > > > > > > > > configured
> > >> > > > > > > > > > > > > > > > > > > > > failure processor. BTW,
> failure
> > >> > > processor
> > >> > > > > can be
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > extended in
> > >> > > > > > > > > > > > > > > > order to
> > >> > > > > > > > > > > > > > > > > > > > > perform any checks (e.g.
> backup
> > >> > count)
> > >> > > > and
> > >> > > > > decide
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > whether it
> > >> > > > > > > > > > > > > > > > should
> > >> > > > > > > > > > > > > > > > > > > > > stop node or not.
> > >> > > > > > > > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM
> > >> Andrey
> > >> > > > > Kuznetsov <
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > [email protected]>
> > >> > > > > > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > David, Yakov, I understand
> > your
> > >> > > fears.
> > >> > > > > But
> > >> > > > > > > >
> > >> > > > > > > > liveness
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > checks
> > >> > > > > > > > > > > > > > > deal
> > >> > > > > > > > > > > > > > > > > > with
> > >> > > > > > > > > > > > > > > > > > > > > > _critical_ conditions, i.e.
> > when
> > >> > > such a
> > >> > > > > > > >
> > >> > > > > > > > condition
> > >> > > > > > > > > >
> > >> > > > > > > > > > is
> > >> > > > > > > > > > > met we
> > >> > > > > > > > > > > > > > > > > > conclude
> > >> > > > > > > > > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > > > > > > > > node as totally broken, and
> > >> there
> > >> > is
> > >> > > no
> > >> > > > > sense
> > >> > > > > > > >
> > >> > > > > > > > to
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > keep it
> > >> > > > > > > > > > > > > > > alive
> > >> > > > > > > > > > > > > > > > > > > > regardless
> > >> > > > > > > > > > > > > > > > > > > > > > the data it contains. If we
> > >> want to
> > >> > > > give
> > >> > > > > it a
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > chance, then
> > >> > > > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > > > > > > condition
> > >> > > > > > > > > > > > > > > > > > > > > > (long fsync etc.) should not
> > >> > > considered
> > >> > > > > as
> > >> > > > > > > >
> > >> > > > > > > > critical
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > at all.
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18,
> > >> Yakov
> > >> > > > > Zhdanov <
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > [email protected]>:
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > > Agree with David. We need
> to
> > >> have
> > >> > > an
> > >> > > > > > > >
> > >> > > > > > > > opporunity
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > set backups
> > >> > > > > > > > > > > > > > > > count
> > >> > > > > > > > > > > > > > > > > > > > > threshold
> > >> > > > > > > > > > > > > > > > > > > > > > > (at runtime also!) that
> will
> > >> not
> > >> > > > allow
> > >> > > > > any
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > automatic stop
> > >> > > > > > > > > > > > > > > if
> > >> > > > > > > > > > > > > > > > > > there
> > >> > > > > > > > > > > > > > > > > > > > > will be
> > >> > > > > > > > > > > > > > > > > > > > > > > a data loss. Andrey, what
> do
> > >> you
> > >> > > > think?
> > >> > > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > > --Yakov
> > >> > > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > > > > > > Best regards,
> > >> > > > > > > > > > > > > > > > > > > > > >   Andrey Kuznetsov.
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > > > Maxim Muzafarov
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > Best regards,
> > >> > > > > > > > > > > > > > > > >   Andrey Kuznetsov.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > Best regards,
> > >> > > > > > > > > > > > > >   Andrey Kuznetsov.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > Best Regards, Vyacheslav D.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > --
> > >> > > > > > > > > > > Best Regards, Vyacheslav D.
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > > --
> > >> > > > > > > > > > --
> > >> > > > > > > > > > Maxim Muzafarov
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > --
> > >> > --
> > >> > Maxim Muzafarov
> > >> >
> > >>
> > >>
> > >> --
> > >> Best regards,
> > >>   Andrey Kuznetsov.
> > >>
> > >
> >
>

Re: Critical worker threads liveness checking drawbacks

Reply via email to