Re: GridDhtInvalidPartitionException takes the cluster down

Andrey Gura Wed, 27 Mar 2019 08:25:22 -0700

What do you think about including patches [1] and [2] to Ignite 2.7.5?
It's all about default failure handler behavior in cases of
SYSTEM_WORKER_BLOCKED and SYSTEM_CRITICAL_OPERATION_TIMEOUT.


Andrey Kuznetsov, could you please check, does IGNITE-10003 depend on
other issue that isn't included into 2.7 release?

[1] https://issues.apache.org/jira/browse/IGNITE-10154
[2] https://issues.apache.org/jira/browse/IGNITE-10003

On Wed, Mar 27, 2019 at 8:11 AM Denis Magda <dma...@apache.org> wrote:
>
> Folks, thanks for sharing details and inputs. This is helpful. As long as I
> spend a lot of time working with Ignite users, I'll look into this topic in
> a couple of days to propose some changes. In the meantime, here is a fresh
> one report on the user list:
> http://apache-ignite-users.70518.x6.nabble.com/Triggering-Rebalancing-Programmatically-get-error-while-requesting-td27651.html
>
>
> -
> Denis
>
>
> On Tue, Mar 26, 2019 at 9:04 AM Andrey Gura <ag...@apache.org> wrote:
>
> > CleanupWorker termination can lead to the following effects:
> >
> > - Queries can retrieve data that have to expired so application will
> > behave incorrectly.
> > - Memory and/or disc can be overflowed because entries weren't expired.
> > - Performance degradation is possible due to unmanageable data set grows.
> >
> > On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh <rsht...@yahoo.com.invalid>
> > wrote:
> > >
> > > Vyacheslav, if you are talking about this particular case I described, I
> > believe it has no influence on PME. What could happen is having
> > CleanupWorker thread dead (which is not good too).But I believe we are
> > talking in a wider scope.
> > >
> > > -- Roman
> > >
> > >
> > >     On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur <
> > daradu...@gmail.com> wrote:
> > >
> > >  In general I agree with Andrey, the handler is very usefull itself. It
> > > allows us to become know that ‘GridDhtInvalidPartitionException’ is not
> > > processed properly in PME process by worker.
> > >
> > > Nikolay, look at the code, if Failure Handler hadles an exception - this
> > > means that while-true loop in worker’s body has been interrupted with
> > > unexpected exception and thread is completed his lifecycle.
> > >
> > > Without Failure Hanller, in the current case, the cluster will hang,
> > > because of unable to participate in PME process.
> > >
> > > So, the problem is the incorrect handling of the exception in PME’s task
> > > wich should be fixed.
> > >
> > >
> > > вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov <stku...@gmail.com>:
> > >
> > > > Nikolay,
> > > >
> > > > Feel free to suggest better error messages to indicate
> > internal/critical
> > > > failures. User actions in response to critical failures are rather
> > limited:
> > > > mail to user-list or maybe file an issue. As for repetitive warnings,
> > it
> > > > makes sense, but requires additional stuff to deliver such signals,
> > mere
> > > > spamming to log will not have an effect.
> > > >
> > > > Anyway, when experienced committers suggest to disable failure
> > handling and
> > > > hide existing issues, I feel as if they are pulling my leg.
> > > >
> > > > Best regards,
> > > > Andrey Kuznetsov.
> > > >
> > > > вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org:
> > > >
> > > > > Andrey.
> > > > >
> > > > > >  the thread can be made non-critical, and we can restart it every
> > time
> > > > it
> > > > > dies
> > > > >
> > > > > Why we can't restart critical thread?
> > > > > What is the root difference between critical and non critical
> > threads?
> > > > >
> > > > > > It's much simpler to catch and handle all exceptions in critical
> > > > threads
> > > > >
> > > > > I don't agree with you.
> > > > > We develop Ignite not because it simple!
> > > > > We must spend extra time to made it robust and resilient to the
> > failures.
> > > > >
> > > > > > Failure handling is a last-chance tool that reveals internal Ignite
> > > > > errors
> > > > > > 100% agree with you: overcome, but not hide.
> > > > >
> > > > > Logging stack trace with proper explanation is not hiding.
> > > > > Killing nodes and whole cluster is not "handling".
> > > > >
> > > > > > As far as I see from user-list messages, our users are qualified
> > enough
> > > > > to provide necessary information from their cluster-wide logs.
> > > > >
> > > > > We shouldn't develop our product only for users who are able to read
> > > > Ignite
> > > > > sources to decrypt the fail reason behind "starvation in stripped
> > pool"
> > > > >
> > > > > Some of my questions remain unanswered :) :
> > > > >
> > > > > 1. How user can know it's an Ignite bug? Where this bug should be
> > > > reported?
> > > > > 2. Do we log it somewhere?
> > > > > 3. Do we warn user before shutdown several times?
> > > > > 4. "starvation in stripped pool" I think it's not clear error
> > message.
> > > > > Let's make it more specific!
> > > > > 5. Let's write to the user log - what he or she should do to prevent
> > this
> > > > > error in future?
> > > > >
> > > > >
> > > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <stku...@gmail.com>:
> > > > >
> > > > > > Nikolay,
> > > > > >
> > > > > > >  Why we can't restart some thread?
> > > > > > Technically, we can. It's just matter of design: the thread can be
> > made
> > > > > > non-critical, and we can restart it every time it dies. But such
> > design
> > > > > > looks poor to me. It's much simpler to catch and handle all
> > exceptions
> > > > in
> > > > > > critical threads. Failure handling is a last-chance tool that
> > reveals
> > > > > > internal Ignite errors. It's not pleasant for us when users see
> > these
> > > > > > errors, but it's better than hiding.
> > > > > >
> > > > > > >  Actually, distributed systems are designed to overcome some
> > bugs,
> > > > > thread
> > > > > > failure, node failure, for example, isn't it?
> > > > > > 100% agree with you: overcome, but not hide.
> > > > > >
> > > > > > >  How user can know it's a bug? Where this bug should be reported?
> > > > > > As far as I see from user-list messages, our users are qualified
> > enough
> > > > > to
> > > > > > provide necessary information from their cluster-wide logs.
> > > > > >
> > > > > >
> > > > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <nizhi...@apache.org
> > >:
> > > > > >
> > > > > > > Andrey.
> > > > > > >
> > > > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is
> > no
> > > > use
> > > > > > to
> > > > > > > wait for dead thread's magical resurrection.
> > > > > > >
> > > > > > > Why is it unrecoverable?
> > > > > > > Why we can't restart some thread?
> > > > > > > Is there some kind of nature limitation to not restart system
> > thread?
> > > > > > >
> > > > > > > Actually, distributed systems are designed to overcome some bugs,
> > > > > thread
> > > > > > > failure, node failure, for example, isn't it?
> > > > > > > > if under some circumstances node> stop leads to cascade cluster
> > > > > crash,
> > > > > > > then it's a bug
> > > > > > >
> > > > > > > How user can know it's a bug? Where this bug should be reported?
> > > > > > > Do we log it somewhere?
> > > > > > > Do we warn user before shutdown one or several times?
> > > > > > >
> > > > > > > This feature kills user experience literally now.
> > > > > > >
> > > > > > > If I would be a user of the product that just shutdown with poor
> > log
> > > > I
> > > > > > > would throw this product away.
> > > > > > > Do we want it for Ignite?
> > > > > > >
> > > > > > > From SO discussion I see following error message: ": >>> Possible
> > > > > > > starvation in striped pool."
> > > > > > > Are you sure this message are clear for Ignite user(not Ignite
> > > > hacker)?
> > > > > > > What user should do to prevent this error in future?
> > > > > > >
> > > > > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > > > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled.
> > I
> > > > > don't
> > > > > > > like
> > > > > > > > this behavior, but it may be useful sometimes: "frozen" threads
> > > > have
> > > > > a
> > > > > > > > chance to become active again after load decreases. As for
> > > > > > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use
> > to
> > > > > wait
> > > > > > > for
> > > > > > > > dead thread's magical resurrection. Then, if under some
> > > > circumstances
> > > > > > > node
> > > > > > > > stop leads to cascade cluster crash, then it's a bug, and it
> > should
> > > > > be
> > > > > > > > fixed. Once and for all. Instead of hiding the flaw we have in
> > the
> > > > > > > product.
> > > > > > > >
> > > > > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh
> > > > <rsht...@yahoo.com.invalid
> > > > > >:
> > > > > > > >
> > > > > > > > > + 1 for having the default settings revisited.
> > > > > > > > > I understand Andrey's reasonings, but sometimes taking nodes
> > down
> > > > > is
> > > > > > > too
> > > > > > > > > radical (as in my case it was
> > GridDhtInvalidPartitionException
> > > > > which
> > > > > > > could
> > > > > > > > > be ignored for a while when rebalancing <- I might be wrong
> > > > here).
> > > > > > > > >
> > > > > > > > > -- Roman
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis
> > Magda <
> > > > > > > > > dma...@apache.org> wrote:
> > > > > > > > >
> > > > > > > > > p    Nikolay,
> > > > > > > > > Thanks for kicking off this discussion. Surprisingly,
> > planned to
> > > > > > start
> > > > > > > a
> > > > > > > > > similar one today and incidentally came across this thread.
> > > > > > > > > Agree that the failure handler should be off by default or
> > the
> > > > > > default
> > > > > > > > > settings have to be revisited. That's true that people are
> > > > > > complaining
> > > > > > > of
> > > > > > > > > nodes shutdowns even on moderate workloads. For instance,
> > that's
> > > > > the
> > > > > > > most
> > > > > > > > > recent feedback related to slow checkpointing:
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > > > > > > >
> > > > > > > > > At a minimum, let's consider the following:
> > > > > > > > >    - A failure handler needs to provide hints on how to come
> > > > around
> > > > > > the
> > > > > > > > > shutdown in the future. Take the checkpointing SO thread
> > above.
> > > > > It's
> > > > > > > > > unclear from the logs how to prevent the same situation next
> > time
> > > > > > > (suggest
> > > > > > > > > parameters for tuning, flash drives, etc).
> > > > > > > > >    - Is there any protection for a full cluster restart? We
> > need
> > > > to
> > > > > > > > > distinguish a slow cluster from the stuck one. A node removal
> > > > > should
> > > > > > > not
> > > > > > > > > lead to a meltdown of the whole storage.
> > > > > > > > >    - Should we enable the failure handler for things like
> > > > > > transactions
> > > > > > > or
> > > > > > > > > PME and have it off for checkpointing and something else?
> > Let's
> > > > > have
> > > > > > it
> > > > > > > > > enabled for cases when we are 100% certain that a node
> > shutdown
> > > > is
> > > > > > the
> > > > > > > > > right thing and print out warnings with suggestions whenever
> > > > we're
> > > > > > not
> > > > > > > > > confident that the removal is appropriate.
> > > > > > > > > --Denis
> > > > > > > > >
> > > > > > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <
> > ag...@apache.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Failure handlers were introduced in order to avoid cluster
> > > > hanging
> > > > > > and
> > > > > > > > > they kill nodes instead.
> > > > > > > > >
> > > > > > > > > If critical worker was terminated by
> > > > > GridDhtInvalidPartitionException
> > > > > > > > > then your node is unable to work anymore.
> > > > > > > > >
> > > > > > > > > Unexpected cluster shutdown with reasons in logs that failure
> > > > > > handlers
> > > > > > > > > provide is better than hanging. So answer is NO. We mustn't
> > > > disable
> > > > > > > > > failure handlers.
> > > > > > > > >
> > > > > > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> > > > > > <rsht...@yahoo.com.invalid
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > If it sticks to the behavior we had before introducing
> > failure
> > > > > > > handler,
> > > > > > > > >
> > > > > > > > > I think it's better to have disabled instead of killing the
> > whole
> > > > > > > cluster,
> > > > > > > > > as in my case, and create a parent issue for those ten
> > > > bugs.Pavel,
> > > > > > > thanks
> > > > > > > > > for the suggestion!
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >    On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay
> > > > > Izhikov
> > > > > > <
> > > > > > > > >
> > > > > > > > > nizhi...@apache.org> wrote:
> > > > > > > > > >
> > > > > > > > > >  Guys.
> > > > > > > > > >
> > > > > > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for
> > all.
> > > > > > > > > > Seems, we have ten or more "cluster shutdown" bugs with
> > this
> > > > > > > subsystem
> > > > > > > > > > since it was introduced.
> > > > > > > > > >
> > > > > > > > > > Should we disable it by default in 2.7.5?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <
> > > > jokse...@gmail.com
> > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hi Roman,
> > > > > > > > > > >
> > > > > > > > > > > I think this InvalidPartition case can be simply handled
> > > > > > > > > > > in GridCacheTtlManager.expire method.
> > > > > > > > > > > For workaround a custom FailureHandler can be configured
> > that
> > > > > > will
> > > > > > > not
> > > > > > > > >
> > > > > > > > > stop
> > > > > > > > > > > a node in case of such exception is thrown.
> > > > > > > > > > >
> > > > > > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > > > > > > <rsht...@yahoo.com.invalid>:
> > > > > > > > > > >
> > > > > > > > > > > > Igniters,
> > > > > > > > > > > >
> > > > > > > > > > > > Restarting a node when injecting data and having it
> > > > expired,
> > > > > > > results
> > > > > > > > >
> > > > > > > > > at
> > > > > > > > > > > > GridDhtInvalidPartitionException which terminates nodes
> > > > with
> > > > > > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole
> > > > cluster
> > > > > > > down.
> > > > > > > > >
> > > > > > > > > This
> > > > > > > > > > > is
> > > > > > > > > > > > really bad and I didn't find the way to save the
> > cluster
> > > > from
> > > > > > > > > > >
> > > > > > > > > > > disappearing.
> > > > > > > > > > > > I created a JIRA issue
> > > > > > > > > > >
> > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > > > > > > with a test case. Any clues how to fix this
> > inconsistency
> > > > > when
> > > > > > > > > > >
> > > > > > > > > > > rebalancing?
> > > > > > > > > > > >
> > > > > > > > > > > > -- Roman
> > > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > >  Andrey Kuznetsov.
> > > > > >
> > > > >
> > > >
> > > --
> > > Best Regards, Vyacheslav D.
> >

Re: GridDhtInvalidPartitionException takes the cluster down

Reply via email to