Re: GridDhtInvalidPartitionException takes the cluster down

Roman Shtykh Mon, 25 Mar 2019 23:17:38 -0700

+ 1 for having the default settings revisited.
I understand Andrey's reasonings, but sometimes taking nodes down is too 
radical (as in my case it was GridDhtInvalidPartitionException which could be 
ignored for a while when rebalancing <- I might be wrong here).


-- Roman
 

    On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda 
<[email protected]> wrote:  
 
 Nikolay,
Thanks for kicking off this discussion. Surprisingly, planned to start a 
similar one today and incidentally came across this thread.
Agree that the failure handler should be off by default or the default settings 
have to be revisited. That's true that people are complaining of nodes 
shutdowns even on moderate workloads. For instance, that's the most recent 
feedback related to slow 
checkpointing:https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure

At a minimum, let's consider the following:   
   - A failure handler needs to provide hints on how to come around the 
shutdown in the future. Take the checkpointing SO thread above. It's unclear 
from the logs how to prevent the same situation next time (suggest parameters 
for tuning, flash drives, etc).
   - Is there any protection for a full cluster restart? We need to distinguish 
a slow cluster from the stuck one. A node removal should not lead to a meltdown 
of the whole storage.
   - Should we enable the failure handler for things like transactions or PME 
and have it off for checkpointing and something else? Let's have it enabled for 
cases when we are 100% certain that a node shutdown is the right thing and 
print out warnings with suggestions whenever we're not confident that the 
removal is appropriate.
--Denis

On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[email protected]> wrote:

Failure handlers were introduced in order to avoid cluster hanging and
they kill nodes instead.

If critical worker was terminated by GridDhtInvalidPartitionException
then your node is unable to work anymore.

Unexpected cluster shutdown with reasons in logs that failure handlers
provide is better than hanging. So answer is NO. We mustn't disable
failure handlers.

On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh <[email protected]> wrote:
>
> If it sticks to the behavior we had before introducing failure handler, I 
> think it's better to have disabled instead of killing the whole cluster, as 
> in my case, and create a parent issue for those ten bugs.Pavel, thanks for 
> the suggestion!
>
>
>
>     On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov 
><[email protected]> wrote:
>
>  Guys.
>
> We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> Seems, we have ten or more "cluster shutdown" bugs with this subsystem
> since it was introduced.
>
> Should we disable it by default in 2.7.5?
>
>
> пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <[email protected]>:
>
> > Hi Roman,
> >
> > I think this InvalidPartition case can be simply handled
> > in GridCacheTtlManager.expire method.
> > For workaround a custom FailureHandler can be configured that will not stop
> > a node in case of such exception is thrown.
> >
> > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh <[email protected]>:
> >
> > > Igniters,
> > >
> > > Restarting a node when injecting data and having it expired, results at
> > > GridDhtInvalidPartitionException which terminates nodes with
> > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down. This
> > is
> > > really bad and I didn't find the way to save the cluster from
> > disappearing.
> > > I created a JIRA issue
> > https://issues.apache.org/jira/browse/IGNITE-11620
> > > with a test case. Any clues how to fix this inconsistency when
> > rebalancing?
> > >
> > > -- Roman
> > >
> >

Re: GridDhtInvalidPartitionException takes the cluster down

Reply via email to