Re: GridDhtInvalidPartitionException takes the cluster down

Roman Shtykh Tue, 26 Mar 2019 06:59:19 -0700

Vyacheslav, if you are talking about this particular case I described, I 
believe it has no influence on PME. What could happen is having CleanupWorker 
thread dead (which is not good too).But I believe we are talking in a wider 
scope.


-- Roman
 

    On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur 
<daradu...@gmail.com> wrote:  
 
 In general I agree with Andrey, the handler is very usefull itself. It
allows us to become know that ‘GridDhtInvalidPartitionException’ is not
processed properly in PME process by worker.

Nikolay, look at the code, if Failure Handler hadles an exception - this
means that while-true loop in worker’s body has been interrupted with
unexpected exception and thread is completed his lifecycle.

Without Failure Hanller, in the current case, the cluster will hang,
because of unable to participate in PME process.

So, the problem is the incorrect handling of the exception in PME’s task
wich should be fixed.


вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov <stku...@gmail.com>:

> Nikolay,
>
> Feel free to suggest better error messages to indicate internal/critical
> failures. User actions in response to critical failures are rather limited:
> mail to user-list or maybe file an issue. As for repetitive warnings, it
> makes sense, but requires additional stuff to deliver such signals, mere
> spamming to log will not have an effect.
>
> Anyway, when experienced committers suggest to disable failure handling and
> hide existing issues, I feel as if they are pulling my leg.
>
> Best regards,
> Andrey Kuznetsov.
>
> вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org:
>
> > Andrey.
> >
> > >  the thread can be made non-critical, and we can restart it every time
> it
> > dies
> >
> > Why we can't restart critical thread?
> > What is the root difference between critical and non critical threads?
> >
> > > It's much simpler to catch and handle all exceptions in critical
> threads
> >
> > I don't agree with you.
> > We develop Ignite not because it simple!
> > We must spend extra time to made it robust and resilient to the failures.
> >
> > > Failure handling is a last-chance tool that reveals internal Ignite
> > errors
> > > 100% agree with you: overcome, but not hide.
> >
> > Logging stack trace with proper explanation is not hiding.
> > Killing nodes and whole cluster is not "handling".
> >
> > > As far as I see from user-list messages, our users are qualified enough
> > to provide necessary information from their cluster-wide logs.
> >
> > We shouldn't develop our product only for users who are able to read
> Ignite
> > sources to decrypt the fail reason behind "starvation in stripped pool"
> >
> > Some of my questions remain unanswered :) :
> >
> > 1. How user can know it's an Ignite bug? Where this bug should be
> reported?
> > 2. Do we log it somewhere?
> > 3. Do we warn user before shutdown several times?
> > 4. "starvation in stripped pool" I think it's not clear error message.
> > Let's make it more specific!
> > 5. Let's write to the user log - what he or she should do to prevent this
> > error in future?
> >
> >
> > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <stku...@gmail.com>:
> >
> > > Nikolay,
> > >
> > > >  Why we can't restart some thread?
> > > Technically, we can. It's just matter of design: the thread can be made
> > > non-critical, and we can restart it every time it dies. But such design
> > > looks poor to me. It's much simpler to catch and handle all exceptions
> in
> > > critical threads. Failure handling is a last-chance tool that reveals
> > > internal Ignite errors. It's not pleasant for us when users see these
> > > errors, but it's better than hiding.
> > >
> > > >  Actually, distributed systems are designed to overcome some bugs,
> > thread
> > > failure, node failure, for example, isn't it?
> > > 100% agree with you: overcome, but not hide.
> > >
> > > >  How user can know it's a bug? Where this bug should be reported?
> > > As far as I see from user-list messages, our users are qualified enough
> > to
> > > provide necessary information from their cluster-wide logs.
> > >
> > >
> > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <nizhi...@apache.org>:
> > >
> > > > Andrey.
> > > >
> > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no
> use
> > > to
> > > > wait for dead thread's magical resurrection.
> > > >
> > > > Why is it unrecoverable?
> > > > Why we can't restart some thread?
> > > > Is there some kind of nature limitation to not restart system thread?
> > > >
> > > > Actually, distributed systems are designed to overcome some bugs,
> > thread
> > > > failure, node failure, for example, isn't it?
> > > > > if under some circumstances node> stop leads to cascade cluster
> > crash,
> > > > then it's a bug
> > > >
> > > > How user can know it's a bug? Where this bug should be reported?
> > > > Do we log it somewhere?
> > > > Do we warn user before shutdown one or several times?
> > > >
> > > > This feature kills user experience literally now.
> > > >
> > > > If I would be a user of the product that just shutdown with poor log
> I
> > > > would throw this product away.
> > > > Do we want it for Ignite?
> > > >
> > > > From SO discussion I see following error message: ": >>> Possible
> > > > starvation in striped pool."
> > > > Are you sure this message are clear for Ignite user(not Ignite
> hacker)?
> > > > What user should do to prevent this error in future?
> > > >
> > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I
> > don't
> > > > like
> > > > > this behavior, but it may be useful sometimes: "frozen" threads
> have
> > a
> > > > > chance to become active again after load decreases. As for
> > > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to
> > wait
> > > > for
> > > > > dead thread's magical resurrection. Then, if under some
> circumstances
> > > > node
> > > > > stop leads to cascade cluster crash, then it's a bug, and it should
> > be
> > > > > fixed. Once and for all. Instead of hiding the flaw we have in the
> > > > product.
> > > > >
> > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh
> <rsht...@yahoo.com.invalid
> > >:
> > > > >
> > > > > > + 1 for having the default settings revisited.
> > > > > > I understand Andrey's reasonings, but sometimes taking nodes down
> > is
> > > > too
> > > > > > radical (as in my case it was GridDhtInvalidPartitionException
> > which
> > > > could
> > > > > > be ignored for a while when rebalancing <- I might be wrong
> here).
> > > > > >
> > > > > > -- Roman
> > > > > >
> > > > > >
> > > > > >    On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > > > > > dma...@apache.org> wrote:
> > > > > >
> > > > > > p    Nikolay,
> > > > > > Thanks for kicking off this discussion. Surprisingly, planned to
> > > start
> > > > a
> > > > > > similar one today and incidentally came across this thread.
> > > > > > Agree that the failure handler should be off by default or the
> > > default
> > > > > > settings have to be revisited. That's true that people are
> > > complaining
> > > > of
> > > > > > nodes shutdowns even on moderate workloads. For instance, that's
> > the
> > > > most
> > > > > > recent feedback related to slow checkpointing:
> > > > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > > > >
> > > > > > At a minimum, let's consider the following:
> > > > > >    - A failure handler needs to provide hints on how to come
> around
> > > the
> > > > > > shutdown in the future. Take the checkpointing SO thread above.
> > It's
> > > > > > unclear from the logs how to prevent the same situation next time
> > > > (suggest
> > > > > > parameters for tuning, flash drives, etc).
> > > > > >    - Is there any protection for a full cluster restart? We need
> to
> > > > > > distinguish a slow cluster from the stuck one. A node removal
> > should
> > > > not
> > > > > > lead to a meltdown of the whole storage.
> > > > > >    - Should we enable the failure handler for things like
> > > transactions
> > > > or
> > > > > > PME and have it off for checkpointing and something else? Let's
> > have
> > > it
> > > > > > enabled for cases when we are 100% certain that a node shutdown
> is
> > > the
> > > > > > right thing and print out warnings with suggestions whenever
> we're
> > > not
> > > > > > confident that the removal is appropriate.
> > > > > > --Denis
> > > > > >
> > > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <ag...@apache.org>
> > > wrote:
> > > > > >
> > > > > > Failure handlers were introduced in order to avoid cluster
> hanging
> > > and
> > > > > > they kill nodes instead.
> > > > > >
> > > > > > If critical worker was terminated by
> > GridDhtInvalidPartitionException
> > > > > > then your node is unable to work anymore.
> > > > > >
> > > > > > Unexpected cluster shutdown with reasons in logs that failure
> > > handlers
> > > > > > provide is better than hanging. So answer is NO. We mustn't
> disable
> > > > > > failure handlers.
> > > > > >
> > > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> > > <rsht...@yahoo.com.invalid
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > If it sticks to the behavior we had before introducing failure
> > > > handler,
> > > > > >
> > > > > > I think it's better to have disabled instead of killing the whole
> > > > cluster,
> > > > > > as in my case, and create a parent issue for those ten
> bugs.Pavel,
> > > > thanks
> > > > > > for the suggestion!
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >    On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay
> > Izhikov
> > > <
> > > > > >
> > > > > > nizhi...@apache.org> wrote:
> > > > > > >
> > > > > > >  Guys.
> > > > > > >
> > > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > > > > > > Seems, we have ten or more "cluster shutdown" bugs with this
> > > > subsystem
> > > > > > > since it was introduced.
> > > > > > >
> > > > > > > Should we disable it by default in 2.7.5?
> > > > > > >
> > > > > > >
> > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <
> jokse...@gmail.com
> > >:
> > > > > > >
> > > > > > > > Hi Roman,
> > > > > > > >
> > > > > > > > I think this InvalidPartition case can be simply handled
> > > > > > > > in GridCacheTtlManager.expire method.
> > > > > > > > For workaround a custom FailureHandler can be configured that
> > > will
> > > > not
> > > > > >
> > > > > > stop
> > > > > > > > a node in case of such exception is thrown.
> > > > > > > >
> > > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > > > <rsht...@yahoo.com.invalid>:
> > > > > > > >
> > > > > > > > > Igniters,
> > > > > > > > >
> > > > > > > > > Restarting a node when injecting data and having it
> expired,
> > > > results
> > > > > >
> > > > > > at
> > > > > > > > > GridDhtInvalidPartitionException which terminates nodes
> with
> > > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole
> cluster
> > > > down.
> > > > > >
> > > > > > This
> > > > > > > > is
> > > > > > > > > really bad and I didn't find the way to save the cluster
> from
> > > > > > > >
> > > > > > > > disappearing.
> > > > > > > > > I created a JIRA issue
> > > > > > > >
> > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > > > with a test case. Any clues how to fix this inconsistency
> > when
> > > > > > > >
> > > > > > > > rebalancing?
> > > > > > > > >
> > > > > > > > > -- Roman
> > > > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >  Andrey Kuznetsov.
> > >
> >
>
-- 
Best Regards, Vyacheslav D.

Re: GridDhtInvalidPartitionException takes the cluster down

Reply via email to