Re: Critical worker threads liveness checking drawbacks

Andrey Gura Wed, 26 Sep 2018 12:29:11 -0700

Vyacheslav,

Exchange worker is strongly tied with
GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker also
shouldn't be blocked for long time but in reality it happens.It also
means that your change doesn't make sense.


What actually make sense it is identification of places which
intentionally blocking. May be some places/actions should be braced by
blocking guards.

If you have failing tests please make sure that your failureHandler is
NoOpFailureHandler or any other handler with ignoreFailureTypes =
[CRITICAL_WORKER_BLOCKED].


On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <daradu...@gmail.com> wrote:
>
> Hi Igniters!
>
> Thank you for this important improvement!
>
> I've looked through implementation and noticed that
> GridDhtPartitionsExchangeFuture#init has not been wrapped in blocked
> section. This means it easy to halt the node in case of longrunning
> actions during PME, for example when we create a cache with
> StoreFactrory which connect to 3rd party DB.
>
> I'm not sure that it is the right behavior.
>
> I filled the issue [1] and prepared the PR [2] with reproducer and possible 
> fix.
>
> Andrey, could you please look at and confirm that it makes sense?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-9710
> [2] https://github.com/apache/ignite/pull/4845
> On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <stku...@gmail.com> wrote:
> >
> > Denis,
> >
> > I've created the ticket [1] with short description of the functionality.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> >
> >
> > пн, 24 сент. 2018 г. в 17:46, Denis Magda <dma...@apache.org>:
> >
> > > Andrey K. and G.,
> > >
> > > Thanks, do we have a documentation ticket created? Prachi (copied) can 
> > > help
> > > with the documentation.
> > >
> > > --
> > > Denis
> > >
> > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <ag...@apache.org> wrote:
> > >
> > > > Andrey,
> > > >
> > > > finally your change is merged to master branch. Congratulations and
> > > > thank you very much! :)
> > > >
> > > > I think that the next step is feature that will allow signal about
> > > > blocked threads to the monitoring tools via MXBean.
> > > >
> > > > I hope you will continue development of this feature and provide your
> > > > vision in new JIRA issue.
> > > >
> > > >
> > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <stku...@gmail.com>
> > > > wrote:
> > > > >
> > > > > David, Maxim!
> > > > >
> > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt all of them
> > > > right
> > > > > now: the scope is much broader than the scope of the change I
> > > implement.
> > > > I
> > > > > have had a talk to a group of Ignite commiters, and we agreed to
> > > complete
> > > > > the change as follows.
> > > > > - Blocking instructions in system-critical which may resonably last
> > > long
> > > > > should be explicitly excluded from the monitoring.
> > > > > - Failure handlers should have a setting to suppress some failures on
> > > > > per-failure-type basis.
> > > > > According to this I have updated the implementation: [1]
> > > > >
> > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > >
> > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey <syssoft...@gmail.com>:
> > > > >
> > > > > > When I've done this before,I've needed to find the oldest  thread,
> > > and
> > > > kill
> > > > > > the node running that.   From a language standpoint, Maxim's 
> > > > > > "without
> > > > > > progress" better than "heartbeat".   For example, what I'm most
> > > > interested
> > > > > > in on a distributed system is which thread started the work it has
> > > not
> > > > > > completed the earliest, and when did that thread last make forward
> > > > > > process.     You don't want to kill a node because a thread is
> > > waiting
> > > > on a
> > > > > > lock held by a thread that went off-node and has not gotten a
> > > response.
> > > > > > If you don't understand the dependency relationships, you will make
> > > > > > incorrect recovery decisions.
> > > > > >
> > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov <maxmu...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > I think we should find exact answers to these questions:
> > > > > > >  1. What `critical` issue exactly is?
> > > > > > >  2. How can we find critical issues?
> > > > > > >  3. How can we handle critical issues?
> > > > > > >
> > > > > > > First,
> > > > > > >  - Ignore uninterruptable actions (e.g. worker\service shutdown)
> > > > > > >  - Long I/O operations (should be a configurable timeout for each
> > > > type of
> > > > > > > usage)
> > > > > > >  - Infinite loops
> > > > > > >  - Stalled\deadlocked threads (and\or too many parked threads,
> > > > exclude
> > > > > > I/O)
> > > > > > >
> > > > > > > Second,
> > > > > > >  - The working queue is without progress (e.g. disco, exchange
> > > > queues)
> > > > > > >  - Work hasn't been completed since the last heartbeat (checking
> > > > > > > milestones)
> > > > > > >  - Too many system resources used by a thread for the long period
> > > of
> > > > time
> > > > > > > (allocated memory, CPU)
> > > > > > >  - Timing fields associated with each thread status exceeded a
> > > > maximum
> > > > > > time
> > > > > > > limit.
> > > > > > >
> > > > > > > Third (not too many options here),
> > > > > > >  - `log everything` should be the default behaviour in all these
> > > > cases,
> > > > > > > since it may be difficult to find the cause after the restart.
> > > > > > >  - Wait some interval of time and kill the hanging node (cluster
> > > > should
> > > > > > be
> > > > > > > configured stable enough)
> > > > > > >
> > > > > > > Questions,
> > > > > > >  - Not sure, but can workers miss their heartbeat deadlines if CPU
> > > > loads
> > > > > > up
> > > > > > > to 80%-90%? Bursts of momentary overloads can be
> > > > > > >     expected behaviour as a normal part of system operations.
> > > > > > >  - Why do we decide that critical thread should monitor each 
> > > > > > > other?
> > > > For
> > > > > > > instance, if all the tasks were blocked and unable to run,
> > > > > > >     node reset would never occur. As for me, a better solution is
> > > to
> > > > use
> > > > > > a
> > > > > > > separate monitor thread or pool (maybe both with software
> > > > > > >     and hardware checks) that not only checks heartbeats but
> > > > monitors the
> > > > > > > other system as well.
> > > > > > >
> > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey <syssoft...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > It would be safer to restart the entire cluster than to remove
> > > the
> > > > last
> > > > > > > > node for a cache that should be redundant.
> > > > > > > >
> > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <ag...@apache.org>
> > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I agree with Yakov that we can provide some option that manage
> > > > worker
> > > > > > > > > liveness checker behavior in case of observing that some 
> > > > > > > > > worker
> > > > is
> > > > > > > > > blocked too long.
> > > > > > > > > At least it will  some workaround for cases when node fails is
> > > > too
> > > > > > > > > annoying.
> > > > > > > > >
> > > > > > > > > Backups count threshold sounds good but I don't understand how
> > > it
> > > > > > will
> > > > > > > > > help in case of cluster hanging.
> > > > > > > > >
> > > > > > > > > The simplest solution here is alert in cases of blocking of
> > > some
> > > > > > > > > critical worker (we can improve WorkersRegistry for this
> > > purpose
> > > > and
> > > > > > > > > expose list of blocked workers) and optionally call system
> > > > configured
> > > > > > > > > failure processor. BTW, failure processor can be extended in
> > > > order to
> > > > > > > > > perform any checks (e.g. backup count) and decide whether it
> > > > should
> > > > > > > > > stop node or not.
> > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <
> > > > stku...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > David, Yakov, I understand your fears. But liveness checks
> > > deal
> > > > > > with
> > > > > > > > > > _critical_ conditions, i.e. when such a condition is met we
> > > > > > conclude
> > > > > > > > the
> > > > > > > > > > node as totally broken, and there is no sense to keep it
> > > alive
> > > > > > > > regardless
> > > > > > > > > > the data it contains. If we want to give it a chance, then
> > > the
> > > > > > > > condition
> > > > > > > > > > (long fsync etc.) should not considered as critical at all.
> > > > > > > > > >
> > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <
> > > > yzhda...@apache.org>:
> > > > > > > > > >
> > > > > > > > > > > Agree with David. We need to have an opporunity set 
> > > > > > > > > > > backups
> > > > count
> > > > > > > > > threshold
> > > > > > > > > > > (at runtime also!) that will not allow any automatic stop
> > > if
> > > > > > there
> > > > > > > > > will be
> > > > > > > > > > > a data loss. Andrey, what do you think?
> > > > > > > > > > >
> > > > > > > > > > > --Yakov
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best regards,
> > > > > > > > > >   Andrey Kuznetsov.
> > > > > > > > >
> > > > > > > >
> > > > > > > --
> > > > > > > --
> > > > > > > Maxim Muzafarov
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >   Andrey Kuznetsov.
> > > >
> > >
> >
> >
> > --
> > Best regards,
> >   Andrey Kuznetsov.
>
>
>
> --
> Best Regards, Vyacheslav D.

Re: Critical worker threads liveness checking drawbacks

Reply via email to