Re: GridDhtInvalidPartitionException takes the cluster down

Andrey Kuznetsov Tue, 26 Mar 2019 02:14:04 -0700

Nikolay,

>  Why we can't restart some thread?
Technically, we can. It's just matter of design: the thread can be made
non-critical, and we can restart it every time it dies. But such design
looks poor to me. It's much simpler to catch and handle all exceptions in
critical threads. Failure handling is a last-chance tool that reveals
internal Ignite errors. It's not pleasant for us when users see these
errors, but it's better than hiding.


>  Actually, distributed systems are designed to overcome some bugs, thread
failure, node failure, for example, isn't it?
100% agree with you: overcome, but not hide.

>  How user can know it's a bug? Where this bug should be reported?
As far as I see from user-list messages, our users are qualified enough to
provide necessary information from their cluster-wide logs.


вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <[email protected]>:

> Andrey.
>
> > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to
> wait for dead thread's magical resurrection.
>
> Why is it unrecoverable?
> Why we can't restart some thread?
> Is there some kind of nature limitation to not restart system thread?
>
> Actually, distributed systems are designed to overcome some bugs, thread
> failure, node failure, for example, isn't it?
> > if under some circumstances node> stop leads to cascade cluster crash,
> then it's a bug
>
> How user can know it's a bug? Where this bug should be reported?
> Do we log it somewhere?
> Do we warn user before shutdown one or several times?
>
> This feature kills user experience literally now.
>
> If I would be a user of the product that just shutdown with poor log I
> would throw this product away.
> Do we want it for Ignite?
>
> From SO discussion I see following error message: ": >>> Possible
> starvation in striped pool."
> Are you sure this message are clear for Ignite user(not Ignite hacker)?
> What user should do to prevent this error in future?
>
> В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't
> like
> > this behavior, but it may be useful sometimes: "frozen" threads have a
> > chance to become active again after load decreases. As for
> > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait
> for
> > dead thread's magical resurrection. Then, if under some circumstances
> node
> > stop leads to cascade cluster crash, then it's a bug, and it should be
> > fixed. Once and for all. Instead of hiding the flaw we have in the
> product.
> >
> > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh <[email protected]>:
> >
> > > + 1 for having the default settings revisited.
> > > I understand Andrey's reasonings, but sometimes taking nodes down is
> too
> > > radical (as in my case it was GridDhtInvalidPartitionException which
> could
> > > be ignored for a while when rebalancing <- I might be wrong here).
> > >
> > > -- Roman
> > >
> > >
> > >     On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > > [email protected]> wrote:
> > >
> > > p    Nikolay,
> > > Thanks for kicking off this discussion. Surprisingly, planned to start
> a
> > > similar one today and incidentally came across this thread.
> > > Agree that the failure handler should be off by default or the default
> > > settings have to be revisited. That's true that people are complaining
> of
> > > nodes shutdowns even on moderate workloads. For instance, that's the
> most
> > > recent feedback related to slow checkpointing:
> > >
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > >
> > > At a minimum, let's consider the following:
> > >    - A failure handler needs to provide hints on how to come around the
> > > shutdown in the future. Take the checkpointing SO thread above. It's
> > > unclear from the logs how to prevent the same situation next time
> (suggest
> > > parameters for tuning, flash drives, etc).
> > >    - Is there any protection for a full cluster restart? We need to
> > > distinguish a slow cluster from the stuck one. A node removal should
> not
> > > lead to a meltdown of the whole storage.
> > >    - Should we enable the failure handler for things like transactions
> or
> > > PME and have it off for checkpointing and something else? Let's have it
> > > enabled for cases when we are 100% certain that a node shutdown is the
> > > right thing and print out warnings with suggestions whenever we're not
> > > confident that the removal is appropriate.
> > > --Denis
> > >
> > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[email protected]> wrote:
> > >
> > > Failure handlers were introduced in order to avoid cluster hanging and
> > > they kill nodes instead.
> > >
> > > If critical worker was terminated by GridDhtInvalidPartitionException
> > > then your node is unable to work anymore.
> > >
> > > Unexpected cluster shutdown with reasons in logs that failure handlers
> > > provide is better than hanging. So answer is NO. We mustn't disable
> > > failure handlers.
> > >
> > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh <[email protected]
> >
> > > wrote:
> > > >
> > > > If it sticks to the behavior we had before introducing failure
> handler,
> > >
> > > I think it's better to have disabled instead of killing the whole
> cluster,
> > > as in my case, and create a parent issue for those ten bugs.Pavel,
> thanks
> > > for the suggestion!
> > > >
> > > >
> > > >
> > > >     On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov <
> > >
> > > [email protected]> wrote:
> > > >
> > > >  Guys.
> > > >
> > > > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > > > Seems, we have ten or more "cluster shutdown" bugs with this
> subsystem
> > > > since it was introduced.
> > > >
> > > > Should we disable it by default in 2.7.5?
> > > >
> > > >
> > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <[email protected]>:
> > > >
> > > > > Hi Roman,
> > > > >
> > > > > I think this InvalidPartition case can be simply handled
> > > > > in GridCacheTtlManager.expire method.
> > > > > For workaround a custom FailureHandler can be configured that will
> not
> > >
> > > stop
> > > > > a node in case of such exception is thrown.
> > > > >
> > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> <[email protected]>:
> > > > >
> > > > > > Igniters,
> > > > > >
> > > > > > Restarting a node when injecting data and having it expired,
> results
> > >
> > > at
> > > > > > GridDhtInvalidPartitionException which terminates nodes with
> > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster
> down.
> > >
> > > This
> > > > > is
> > > > > > really bad and I didn't find the way to save the cluster from
> > > > >
> > > > > disappearing.
> > > > > > I created a JIRA issue
> > > > >
> > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > with a test case. Any clues how to fix this inconsistency when
> > > > >
> > > > > rebalancing?
> > > > > >
> > > > > > -- Roman
> > > > > >
> > >
> > >
> >
> >
> >
>


-- 
Best regards,
  Andrey Kuznetsov.

Re: GridDhtInvalidPartitionException takes the cluster down

Reply via email to