Re: GridDhtInvalidPartitionException takes the cluster down

Andrey Kuznetsov Wed, 27 Mar 2019 08:52:28 -0700

I see no other dependencies for IGNITE-10003.

Best regards,
Andrey Kuznetsov.


ср, 27 марта 2019, 18:25 Andrey Gura ag...@apache.org:

> What do you think about including patches [1] and [2] to Ignite 2.7.5?
> It's all about default failure handler behavior in cases of
> SYSTEM_WORKER_BLOCKED and SYSTEM_CRITICAL_OPERATION_TIMEOUT.
>
> Andrey Kuznetsov, could you please check, does IGNITE-10003 depend on
> other issue that isn't included into 2.7 release?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-10154
> [2] https://issues.apache.org/jira/browse/IGNITE-10003
>
> On Wed, Mar 27, 2019 at 8:11 AM Denis Magda <dma...@apache.org> wrote:
> >
> > Folks, thanks for sharing details and inputs. This is helpful. As long
> as I
> > spend a lot of time working with Ignite users, I'll look into this topic
> in
> > a couple of days to propose some changes. In the meantime, here is a
> fresh
> > one report on the user list:
> >
> http://apache-ignite-users.70518.x6.nabble.com/Triggering-Rebalancing-Programmatically-get-error-while-requesting-td27651.html
> >
> >
> > -
> > Denis
> >
> >
> > On Tue, Mar 26, 2019 at 9:04 AM Andrey Gura <ag...@apache.org> wrote:
> >
> > > CleanupWorker termination can lead to the following effects:
> > >
> > > - Queries can retrieve data that have to expired so application will
> > > behave incorrectly.
> > > - Memory and/or disc can be overflowed because entries weren't expired.
> > > - Performance degradation is possible due to unmanageable data set
> grows.
> > >
> > > On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh <rsht...@yahoo.com.invalid
> >
> > > wrote:
> > > >
> > > > Vyacheslav, if you are talking about this particular case I
> described, I
> > > believe it has no influence on PME. What could happen is having
> > > CleanupWorker thread dead (which is not good too).But I believe we are
> > > talking in a wider scope.
> > > >
> > > > -- Roman
> > > >
> > > >
> > > >     On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav
> Daradur <
> > > daradu...@gmail.com> wrote:
> > > >
> > > >  In general I agree with Andrey, the handler is very usefull itself.
> It
> > > > allows us to become know that ‘GridDhtInvalidPartitionException’ is
> not
> > > > processed properly in PME process by worker.
> > > >
> > > > Nikolay, look at the code, if Failure Handler hadles an exception -
> this
> > > > means that while-true loop in worker’s body has been interrupted with
> > > > unexpected exception and thread is completed his lifecycle.
> > > >
> > > > Without Failure Hanller, in the current case, the cluster will hang,
> > > > because of unable to participate in PME process.
> > > >
> > > > So, the problem is the incorrect handling of the exception in PME’s
> task
> > > > wich should be fixed.
> > > >
> > > >
> > > > вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov <stku...@gmail.com>:
> > > >
> > > > > Nikolay,
> > > > >
> > > > > Feel free to suggest better error messages to indicate
> > > internal/critical
> > > > > failures. User actions in response to critical failures are rather
> > > limited:
> > > > > mail to user-list or maybe file an issue. As for repetitive
> warnings,
> > > it
> > > > > makes sense, but requires additional stuff to deliver such signals,
> > > mere
> > > > > spamming to log will not have an effect.
> > > > >
> > > > > Anyway, when experienced committers suggest to disable failure
> > > handling and
> > > > > hide existing issues, I feel as if they are pulling my leg.
> > > > >
> > > > > Best regards,
> > > > > Andrey Kuznetsov.
> > > > >
> > > > > вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org:
> > > > >
> > > > > > Andrey.
> > > > > >
> > > > > > >  the thread can be made non-critical, and we can restart it
> every
> > > time
> > > > > it
> > > > > > dies
> > > > > >
> > > > > > Why we can't restart critical thread?
> > > > > > What is the root difference between critical and non critical
> > > threads?
> > > > > >
> > > > > > > It's much simpler to catch and handle all exceptions in
> critical
> > > > > threads
> > > > > >
> > > > > > I don't agree with you.
> > > > > > We develop Ignite not because it simple!
> > > > > > We must spend extra time to made it robust and resilient to the
> > > failures.
> > > > > >
> > > > > > > Failure handling is a last-chance tool that reveals internal
> Ignite
> > > > > > errors
> > > > > > > 100% agree with you: overcome, but not hide.
> > > > > >
> > > > > > Logging stack trace with proper explanation is not hiding.
> > > > > > Killing nodes and whole cluster is not "handling".
> > > > > >
> > > > > > > As far as I see from user-list messages, our users are
> qualified
> > > enough
> > > > > > to provide necessary information from their cluster-wide logs.
> > > > > >
> > > > > > We shouldn't develop our product only for users who are able to
> read
> > > > > Ignite
> > > > > > sources to decrypt the fail reason behind "starvation in stripped
> > > pool"
> > > > > >
> > > > > > Some of my questions remain unanswered :) :
> > > > > >
> > > > > > 1. How user can know it's an Ignite bug? Where this bug should be
> > > > > reported?
> > > > > > 2. Do we log it somewhere?
> > > > > > 3. Do we warn user before shutdown several times?
> > > > > > 4. "starvation in stripped pool" I think it's not clear error
> > > message.
> > > > > > Let's make it more specific!
> > > > > > 5. Let's write to the user log - what he or she should do to
> prevent
> > > this
> > > > > > error in future?
> > > > > >
> > > > > >
> > > > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <stku...@gmail.com
> >:
> > > > > >
> > > > > > > Nikolay,
> > > > > > >
> > > > > > > >  Why we can't restart some thread?
> > > > > > > Technically, we can. It's just matter of design: the thread
> can be
> > > made
> > > > > > > non-critical, and we can restart it every time it dies. But
> such
> > > design
> > > > > > > looks poor to me. It's much simpler to catch and handle all
> > > exceptions
> > > > > in
> > > > > > > critical threads. Failure handling is a last-chance tool that
> > > reveals
> > > > > > > internal Ignite errors. It's not pleasant for us when users see
> > > these
> > > > > > > errors, but it's better than hiding.
> > > > > > >
> > > > > > > >  Actually, distributed systems are designed to overcome some
> > > bugs,
> > > > > > thread
> > > > > > > failure, node failure, for example, isn't it?
> > > > > > > 100% agree with you: overcome, but not hide.
> > > > > > >
> > > > > > > >  How user can know it's a bug? Where this bug should be
> reported?
> > > > > > > As far as I see from user-list messages, our users are
> qualified
> > > enough
> > > > > > to
> > > > > > > provide necessary information from their cluster-wide logs.
> > > > > > >
> > > > > > >
> > > > > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <
> nizhi...@apache.org
> > > >:
> > > > > > >
> > > > > > > > Andrey.
> > > > > > > >
> > > > > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable,
> there is
> > > no
> > > > > use
> > > > > > > to
> > > > > > > > wait for dead thread's magical resurrection.
> > > > > > > >
> > > > > > > > Why is it unrecoverable?
> > > > > > > > Why we can't restart some thread?
> > > > > > > > Is there some kind of nature limitation to not restart system
> > > thread?
> > > > > > > >
> > > > > > > > Actually, distributed systems are designed to overcome some
> bugs,
> > > > > > thread
> > > > > > > > failure, node failure, for example, isn't it?
> > > > > > > > > if under some circumstances node> stop leads to cascade
> cluster
> > > > > > crash,
> > > > > > > > then it's a bug
> > > > > > > >
> > > > > > > > How user can know it's a bug? Where this bug should be
> reported?
> > > > > > > > Do we log it somewhere?
> > > > > > > > Do we warn user before shutdown one or several times?
> > > > > > > >
> > > > > > > > This feature kills user experience literally now.
> > > > > > > >
> > > > > > > > If I would be a user of the product that just shutdown with
> poor
> > > log
> > > > > I
> > > > > > > > would throw this product away.
> > > > > > > > Do we want it for Ignite?
> > > > > > > >
> > > > > > > > From SO discussion I see following error message: ": >>>
> Possible
> > > > > > > > starvation in striped pool."
> > > > > > > > Are you sure this message are clear for Ignite user(not
> Ignite
> > > > > hacker)?
> > > > > > > > What user should do to prevent this error in future?
> > > > > > > >
> > > > > > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > > > > > > By default, SYSTEM_WORKER_BLOCKED failure type is not
> handled.
> > > I
> > > > > > don't
> > > > > > > > like
> > > > > > > > > this behavior, but it may be useful sometimes: "frozen"
> threads
> > > > > have
> > > > > > a
> > > > > > > > > chance to become active again after load decreases. As for
> > > > > > > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no
> use
> > > to
> > > > > > wait
> > > > > > > > for
> > > > > > > > > dead thread's magical resurrection. Then, if under some
> > > > > circumstances
> > > > > > > > node
> > > > > > > > > stop leads to cascade cluster crash, then it's a bug, and
> it
> > > should
> > > > > > be
> > > > > > > > > fixed. Once and for all. Instead of hiding the flaw we
> have in
> > > the
> > > > > > > > product.
> > > > > > > > >
> > > > > > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh
> > > > > <rsht...@yahoo.com.invalid
> > > > > > >:
> > > > > > > > >
> > > > > > > > > > + 1 for having the default settings revisited.
> > > > > > > > > > I understand Andrey's reasonings, but sometimes taking
> nodes
> > > down
> > > > > > is
> > > > > > > > too
> > > > > > > > > > radical (as in my case it was
> > > GridDhtInvalidPartitionException
> > > > > > which
> > > > > > > > could
> > > > > > > > > > be ignored for a while when rebalancing <- I might be
> wrong
> > > > > here).
> > > > > > > > > >
> > > > > > > > > > -- Roman
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >    On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis
> > > Magda <
> > > > > > > > > > dma...@apache.org> wrote:
> > > > > > > > > >
> > > > > > > > > > p    Nikolay,
> > > > > > > > > > Thanks for kicking off this discussion. Surprisingly,
> > > planned to
> > > > > > > start
> > > > > > > > a
> > > > > > > > > > similar one today and incidentally came across this
> thread.
> > > > > > > > > > Agree that the failure handler should be off by default
> or
> > > the
> > > > > > > default
> > > > > > > > > > settings have to be revisited. That's true that people
> are
> > > > > > > complaining
> > > > > > > > of
> > > > > > > > > > nodes shutdowns even on moderate workloads. For instance,
> > > that's
> > > > > > the
> > > > > > > > most
> > > > > > > > > > recent feedback related to slow checkpointing:
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > > > > > > > >
> > > > > > > > > > At a minimum, let's consider the following:
> > > > > > > > > >    - A failure handler needs to provide hints on how to
> come
> > > > > around
> > > > > > > the
> > > > > > > > > > shutdown in the future. Take the checkpointing SO thread
> > > above.
> > > > > > It's
> > > > > > > > > > unclear from the logs how to prevent the same situation
> next
> > > time
> > > > > > > > (suggest
> > > > > > > > > > parameters for tuning, flash drives, etc).
> > > > > > > > > >    - Is there any protection for a full cluster restart?
> We
> > > need
> > > > > to
> > > > > > > > > > distinguish a slow cluster from the stuck one. A node
> removal
> > > > > > should
> > > > > > > > not
> > > > > > > > > > lead to a meltdown of the whole storage.
> > > > > > > > > >    - Should we enable the failure handler for things like
> > > > > > > transactions
> > > > > > > > or
> > > > > > > > > > PME and have it off for checkpointing and something else?
> > > Let's
> > > > > > have
> > > > > > > it
> > > > > > > > > > enabled for cases when we are 100% certain that a node
> > > shutdown
> > > > > is
> > > > > > > the
> > > > > > > > > > right thing and print out warnings with suggestions
> whenever
> > > > > we're
> > > > > > > not
> > > > > > > > > > confident that the removal is appropriate.
> > > > > > > > > > --Denis
> > > > > > > > > >
> > > > > > > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <
> > > ag...@apache.org>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Failure handlers were introduced in order to avoid
> cluster
> > > > > hanging
> > > > > > > and
> > > > > > > > > > they kill nodes instead.
> > > > > > > > > >
> > > > > > > > > > If critical worker was terminated by
> > > > > > GridDhtInvalidPartitionException
> > > > > > > > > > then your node is unable to work anymore.
> > > > > > > > > >
> > > > > > > > > > Unexpected cluster shutdown with reasons in logs that
> failure
> > > > > > > handlers
> > > > > > > > > > provide is better than hanging. So answer is NO. We
> mustn't
> > > > > disable
> > > > > > > > > > failure handlers.
> > > > > > > > > >
> > > > > > > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> > > > > > > <rsht...@yahoo.com.invalid
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > If it sticks to the behavior we had before introducing
> > > failure
> > > > > > > > handler,
> > > > > > > > > >
> > > > > > > > > > I think it's better to have disabled instead of killing
> the
> > > whole
> > > > > > > > cluster,
> > > > > > > > > > as in my case, and create a parent issue for those ten
> > > > > bugs.Pavel,
> > > > > > > > thanks
> > > > > > > > > > for the suggestion!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    On Monday, March 25, 2019, 7:07:20 p.m. GMT+9,
> Nikolay
> > > > > > Izhikov
> > > > > > > <
> > > > > > > > > >
> > > > > > > > > > nizhi...@apache.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > >  Guys.
> > > > > > > > > > >
> > > > > > > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and
> for
> > > all.
> > > > > > > > > > > Seems, we have ten or more "cluster shutdown" bugs with
> > > this
> > > > > > > > subsystem
> > > > > > > > > > > since it was introduced.
> > > > > > > > > > >
> > > > > > > > > > > Should we disable it by default in 2.7.5?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <
> > > > > jokse...@gmail.com
> > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Roman,
> > > > > > > > > > > >
> > > > > > > > > > > > I think this InvalidPartition case can be simply
> handled
> > > > > > > > > > > > in GridCacheTtlManager.expire method.
> > > > > > > > > > > > For workaround a custom FailureHandler can be
> configured
> > > that
> > > > > > > will
> > > > > > > > not
> > > > > > > > > >
> > > > > > > > > > stop
> > > > > > > > > > > > a node in case of such exception is thrown.
> > > > > > > > > > > >
> > > > > > > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > > > > > > > <rsht...@yahoo.com.invalid>:
> > > > > > > > > > > >
> > > > > > > > > > > > > Igniters,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Restarting a node when injecting data and having it
> > > > > expired,
> > > > > > > > results
> > > > > > > > > >
> > > > > > > > > > at
> > > > > > > > > > > > > GridDhtInvalidPartitionException which terminates
> nodes
> > > > > with
> > > > > > > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the
> whole
> > > > > cluster
> > > > > > > > down.
> > > > > > > > > >
> > > > > > > > > > This
> > > > > > > > > > > > is
> > > > > > > > > > > > > really bad and I didn't find the way to save the
> > > cluster
> > > > > from
> > > > > > > > > > > >
> > > > > > > > > > > > disappearing.
> > > > > > > > > > > > > I created a JIRA issue
> > > > > > > > > > > >
> > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > > > > > > > with a test case. Any clues how to fix this
> > > inconsistency
> > > > > > when
> > > > > > > > > > > >
> > > > > > > > > > > > rebalancing?
> > > > > > > > > > > > >
> > > > > > > > > > > > > -- Roman
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >  Andrey Kuznetsov.
> > > > > > >
> > > > > >
> > > > >
> > > > --
> > > > Best Regards, Vyacheslav D.
> > >
>

Re: GridDhtInvalidPartitionException takes the cluster down

Reply via email to