Re: GridDhtInvalidPartitionException takes the cluster down

Andrey Kuznetsov Tue, 26 Mar 2019 00:11:05 -0700

By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't like
this behavior, but it may be useful sometimes: "frozen" threads have a
chance to become active again after load decreases. As for
SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait for
dead thread's magical resurrection. Then, if under some circumstances node
stop leads to cascade cluster crash, then it's a bug, and it should be
fixed. Once and for all. Instead of hiding the flaw we have in the product.


вт, 26 мар. 2019 г. в 09:17, Roman Shtykh <rsht...@yahoo.com.invalid>:

> + 1 for having the default settings revisited.
> I understand Andrey's reasonings, but sometimes taking nodes down is too
> radical (as in my case it was GridDhtInvalidPartitionException which could
> be ignored for a while when rebalancing <- I might be wrong here).
>
> -- Roman
>
>
>     On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> dma...@apache.org> wrote:
>
>  Nikolay,
> Thanks for kicking off this discussion. Surprisingly, planned to start a
> similar one today and incidentally came across this thread.
> Agree that the failure handler should be off by default or the default
> settings have to be revisited. That's true that people are complaining of
> nodes shutdowns even on moderate workloads. For instance, that's the most
> recent feedback related to slow checkpointing:
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
>
> At a minimum, let's consider the following:
>    - A failure handler needs to provide hints on how to come around the
> shutdown in the future. Take the checkpointing SO thread above. It's
> unclear from the logs how to prevent the same situation next time (suggest
> parameters for tuning, flash drives, etc).
>    - Is there any protection for a full cluster restart? We need to
> distinguish a slow cluster from the stuck one. A node removal should not
> lead to a meltdown of the whole storage.
>    - Should we enable the failure handler for things like transactions or
> PME and have it off for checkpointing and something else? Let's have it
> enabled for cases when we are 100% certain that a node shutdown is the
> right thing and print out warnings with suggestions whenever we're not
> confident that the removal is appropriate.
> --Denis
>
> On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <ag...@apache.org> wrote:
>
> Failure handlers were introduced in order to avoid cluster hanging and
> they kill nodes instead.
>
> If critical worker was terminated by GridDhtInvalidPartitionException
> then your node is unable to work anymore.
>
> Unexpected cluster shutdown with reasons in logs that failure handlers
> provide is better than hanging. So answer is NO. We mustn't disable
> failure handlers.
>
> On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh <rsht...@yahoo.com.invalid>
> wrote:
> >
> > If it sticks to the behavior we had before introducing failure handler,
> I think it's better to have disabled instead of killing the whole cluster,
> as in my case, and create a parent issue for those ten bugs.Pavel, thanks
> for the suggestion!
> >
> >
> >
> >     On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov <
> nizhi...@apache.org> wrote:
> >
> >  Guys.
> >
> > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > Seems, we have ten or more "cluster shutdown" bugs with this subsystem
> > since it was introduced.
> >
> > Should we disable it by default in 2.7.5?
> >
> >
> > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <jokse...@gmail.com>:
> >
> > > Hi Roman,
> > >
> > > I think this InvalidPartition case can be simply handled
> > > in GridCacheTtlManager.expire method.
> > > For workaround a custom FailureHandler can be configured that will not
> stop
> > > a node in case of such exception is thrown.
> > >
> > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh <rsht...@yahoo.com.invalid>:
> > >
> > > > Igniters,
> > > >
> > > > Restarting a node when injecting data and having it expired, results
> at
> > > > GridDhtInvalidPartitionException which terminates nodes with
> > > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down.
> This
> > > is
> > > > really bad and I didn't find the way to save the cluster from
> > > disappearing.
> > > > I created a JIRA issue
> > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > with a test case. Any clues how to fix this inconsistency when
> > > rebalancing?
> > > >
> > > > -- Roman
> > > >
> > >
>
>



-- 
Best regards,
  Andrey Kuznetsov.

Re: GridDhtInvalidPartitionException takes the cluster down

Reply via email to