Re: GridDhtInvalidPartitionException takes the cluster down

Denis Magda Tue, 26 Mar 2019 22:11:24 -0700

Folks, thanks for sharing details and inputs. This is helpful. As long as I
spend a lot of time working with Ignite users, I'll look into this topic in
a couple of days to propose some changes. In the meantime, here is a fresh
one report on the user list:
http://apache-ignite-users.70518.x6.nabble.com/Triggering-Rebalancing-Programmatically-get-error-while-requesting-td27651.html



-
Denis


On Tue, Mar 26, 2019 at 9:04 AM Andrey Gura <[email protected]> wrote:

> CleanupWorker termination can lead to the following effects:
>
> - Queries can retrieve data that have to expired so application will
> behave incorrectly.
> - Memory and/or disc can be overflowed because entries weren't expired.
> - Performance degradation is possible due to unmanageable data set grows.
>
> On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh <[email protected]>
> wrote:
> >
> > Vyacheslav, if you are talking about this particular case I described, I
> believe it has no influence on PME. What could happen is having
> CleanupWorker thread dead (which is not good too).But I believe we are
> talking in a wider scope.
> >
> > -- Roman
> >
> >
> >     On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur <
> [email protected]> wrote:
> >
> >  In general I agree with Andrey, the handler is very usefull itself. It
> > allows us to become know that ‘GridDhtInvalidPartitionException’ is not
> > processed properly in PME process by worker.
> >
> > Nikolay, look at the code, if Failure Handler hadles an exception - this
> > means that while-true loop in worker’s body has been interrupted with
> > unexpected exception and thread is completed his lifecycle.
> >
> > Without Failure Hanller, in the current case, the cluster will hang,
> > because of unable to participate in PME process.
> >
> > So, the problem is the incorrect handling of the exception in PME’s task
> > wich should be fixed.
> >
> >
> > вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov <[email protected]>:
> >
> > > Nikolay,
> > >
> > > Feel free to suggest better error messages to indicate
> internal/critical
> > > failures. User actions in response to critical failures are rather
> limited:
> > > mail to user-list or maybe file an issue. As for repetitive warnings,
> it
> > > makes sense, but requires additional stuff to deliver such signals,
> mere
> > > spamming to log will not have an effect.
> > >
> > > Anyway, when experienced committers suggest to disable failure
> handling and
> > > hide existing issues, I feel as if they are pulling my leg.
> > >
> > > Best regards,
> > > Andrey Kuznetsov.
> > >
> > > вт, 26 марта 2019, 13:30 Nikolay Izhikov [email protected]:
> > >
> > > > Andrey.
> > > >
> > > > >  the thread can be made non-critical, and we can restart it every
> time
> > > it
> > > > dies
> > > >
> > > > Why we can't restart critical thread?
> > > > What is the root difference between critical and non critical
> threads?
> > > >
> > > > > It's much simpler to catch and handle all exceptions in critical
> > > threads
> > > >
> > > > I don't agree with you.
> > > > We develop Ignite not because it simple!
> > > > We must spend extra time to made it robust and resilient to the
> failures.
> > > >
> > > > > Failure handling is a last-chance tool that reveals internal Ignite
> > > > errors
> > > > > 100% agree with you: overcome, but not hide.
> > > >
> > > > Logging stack trace with proper explanation is not hiding.
> > > > Killing nodes and whole cluster is not "handling".
> > > >
> > > > > As far as I see from user-list messages, our users are qualified
> enough
> > > > to provide necessary information from their cluster-wide logs.
> > > >
> > > > We shouldn't develop our product only for users who are able to read
> > > Ignite
> > > > sources to decrypt the fail reason behind "starvation in stripped
> pool"
> > > >
> > > > Some of my questions remain unanswered :) :
> > > >
> > > > 1. How user can know it's an Ignite bug? Where this bug should be
> > > reported?
> > > > 2. Do we log it somewhere?
> > > > 3. Do we warn user before shutdown several times?
> > > > 4. "starvation in stripped pool" I think it's not clear error
> message.
> > > > Let's make it more specific!
> > > > 5. Let's write to the user log - what he or she should do to prevent
> this
> > > > error in future?
> > > >
> > > >
> > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <[email protected]>:
> > > >
> > > > > Nikolay,
> > > > >
> > > > > >  Why we can't restart some thread?
> > > > > Technically, we can. It's just matter of design: the thread can be
> made
> > > > > non-critical, and we can restart it every time it dies. But such
> design
> > > > > looks poor to me. It's much simpler to catch and handle all
> exceptions
> > > in
> > > > > critical threads. Failure handling is a last-chance tool that
> reveals
> > > > > internal Ignite errors. It's not pleasant for us when users see
> these
> > > > > errors, but it's better than hiding.
> > > > >
> > > > > >  Actually, distributed systems are designed to overcome some
> bugs,
> > > > thread
> > > > > failure, node failure, for example, isn't it?
> > > > > 100% agree with you: overcome, but not hide.
> > > > >
> > > > > >  How user can know it's a bug? Where this bug should be reported?
> > > > > As far as I see from user-list messages, our users are qualified
> enough
> > > > to
> > > > > provide necessary information from their cluster-wide logs.
> > > > >
> > > > >
> > > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <[email protected]
> >:
> > > > >
> > > > > > Andrey.
> > > > > >
> > > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is
> no
> > > use
> > > > > to
> > > > > > wait for dead thread's magical resurrection.
> > > > > >
> > > > > > Why is it unrecoverable?
> > > > > > Why we can't restart some thread?
> > > > > > Is there some kind of nature limitation to not restart system
> thread?
> > > > > >
> > > > > > Actually, distributed systems are designed to overcome some bugs,
> > > > thread
> > > > > > failure, node failure, for example, isn't it?
> > > > > > > if under some circumstances node> stop leads to cascade cluster
> > > > crash,
> > > > > > then it's a bug
> > > > > >
> > > > > > How user can know it's a bug? Where this bug should be reported?
> > > > > > Do we log it somewhere?
> > > > > > Do we warn user before shutdown one or several times?
> > > > > >
> > > > > > This feature kills user experience literally now.
> > > > > >
> > > > > > If I would be a user of the product that just shutdown with poor
> log
> > > I
> > > > > > would throw this product away.
> > > > > > Do we want it for Ignite?
> > > > > >
> > > > > > From SO discussion I see following error message: ": >>> Possible
> > > > > > starvation in striped pool."
> > > > > > Are you sure this message are clear for Ignite user(not Ignite
> > > hacker)?
> > > > > > What user should do to prevent this error in future?
> > > > > >
> > > > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled.
> I
> > > > don't
> > > > > > like
> > > > > > > this behavior, but it may be useful sometimes: "frozen" threads
> > > have
> > > > a
> > > > > > > chance to become active again after load decreases. As for
> > > > > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use
> to
> > > > wait
> > > > > > for
> > > > > > > dead thread's magical resurrection. Then, if under some
> > > circumstances
> > > > > > node
> > > > > > > stop leads to cascade cluster crash, then it's a bug, and it
> should
> > > > be
> > > > > > > fixed. Once and for all. Instead of hiding the flaw we have in
> the
> > > > > > product.
> > > > > > >
> > > > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh
> > > <[email protected]
> > > > >:
> > > > > > >
> > > > > > > > + 1 for having the default settings revisited.
> > > > > > > > I understand Andrey's reasonings, but sometimes taking nodes
> down
> > > > is
> > > > > > too
> > > > > > > > radical (as in my case it was
> GridDhtInvalidPartitionException
> > > > which
> > > > > > could
> > > > > > > > be ignored for a while when rebalancing <- I might be wrong
> > > here).
> > > > > > > >
> > > > > > > > -- Roman
> > > > > > > >
> > > > > > > >
> > > > > > > >    On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis
> Magda <
> > > > > > > > [email protected]> wrote:
> > > > > > > >
> > > > > > > > p    Nikolay,
> > > > > > > > Thanks for kicking off this discussion. Surprisingly,
> planned to
> > > > > start
> > > > > > a
> > > > > > > > similar one today and incidentally came across this thread.
> > > > > > > > Agree that the failure handler should be off by default or
> the
> > > > > default
> > > > > > > > settings have to be revisited. That's true that people are
> > > > > complaining
> > > > > > of
> > > > > > > > nodes shutdowns even on moderate workloads. For instance,
> that's
> > > > the
> > > > > > most
> > > > > > > > recent feedback related to slow checkpointing:
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > > > > > >
> > > > > > > > At a minimum, let's consider the following:
> > > > > > > >    - A failure handler needs to provide hints on how to come
> > > around
> > > > > the
> > > > > > > > shutdown in the future. Take the checkpointing SO thread
> above.
> > > > It's
> > > > > > > > unclear from the logs how to prevent the same situation next
> time
> > > > > > (suggest
> > > > > > > > parameters for tuning, flash drives, etc).
> > > > > > > >    - Is there any protection for a full cluster restart? We
> need
> > > to
> > > > > > > > distinguish a slow cluster from the stuck one. A node removal
> > > > should
> > > > > > not
> > > > > > > > lead to a meltdown of the whole storage.
> > > > > > > >    - Should we enable the failure handler for things like
> > > > > transactions
> > > > > > or
> > > > > > > > PME and have it off for checkpointing and something else?
> Let's
> > > > have
> > > > > it
> > > > > > > > enabled for cases when we are 100% certain that a node
> shutdown
> > > is
> > > > > the
> > > > > > > > right thing and print out warnings with suggestions whenever
> > > we're
> > > > > not
> > > > > > > > confident that the removal is appropriate.
> > > > > > > > --Denis
> > > > > > > >
> > > > > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <
> [email protected]>
> > > > > wrote:
> > > > > > > >
> > > > > > > > Failure handlers were introduced in order to avoid cluster
> > > hanging
> > > > > and
> > > > > > > > they kill nodes instead.
> > > > > > > >
> > > > > > > > If critical worker was terminated by
> > > > GridDhtInvalidPartitionException
> > > > > > > > then your node is unable to work anymore.
> > > > > > > >
> > > > > > > > Unexpected cluster shutdown with reasons in logs that failure
> > > > > handlers
> > > > > > > > provide is better than hanging. So answer is NO. We mustn't
> > > disable
> > > > > > > > failure handlers.
> > > > > > > >
> > > > > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> > > > > <[email protected]
> > > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > If it sticks to the behavior we had before introducing
> failure
> > > > > > handler,
> > > > > > > >
> > > > > > > > I think it's better to have disabled instead of killing the
> whole
> > > > > > cluster,
> > > > > > > > as in my case, and create a parent issue for those ten
> > > bugs.Pavel,
> > > > > > thanks
> > > > > > > > for the suggestion!
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay
> > > > Izhikov
> > > > > <
> > > > > > > >
> > > > > > > > [email protected]> wrote:
> > > > > > > > >
> > > > > > > > >  Guys.
> > > > > > > > >
> > > > > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for
> all.
> > > > > > > > > Seems, we have ten or more "cluster shutdown" bugs with
> this
> > > > > > subsystem
> > > > > > > > > since it was introduced.
> > > > > > > > >
> > > > > > > > > Should we disable it by default in 2.7.5?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <
> > > [email protected]
> > > > >:
> > > > > > > > >
> > > > > > > > > > Hi Roman,
> > > > > > > > > >
> > > > > > > > > > I think this InvalidPartition case can be simply handled
> > > > > > > > > > in GridCacheTtlManager.expire method.
> > > > > > > > > > For workaround a custom FailureHandler can be configured
> that
> > > > > will
> > > > > > not
> > > > > > > >
> > > > > > > > stop
> > > > > > > > > > a node in case of such exception is thrown.
> > > > > > > > > >
> > > > > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > > > > > <[email protected]>:
> > > > > > > > > >
> > > > > > > > > > > Igniters,
> > > > > > > > > > >
> > > > > > > > > > > Restarting a node when injecting data and having it
> > > expired,
> > > > > > results
> > > > > > > >
> > > > > > > > at
> > > > > > > > > > > GridDhtInvalidPartitionException which terminates nodes
> > > with
> > > > > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole
> > > cluster
> > > > > > down.
> > > > > > > >
> > > > > > > > This
> > > > > > > > > > is
> > > > > > > > > > > really bad and I didn't find the way to save the
> cluster
> > > from
> > > > > > > > > >
> > > > > > > > > > disappearing.
> > > > > > > > > > > I created a JIRA issue
> > > > > > > > > >
> > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > > > > > with a test case. Any clues how to fix this
> inconsistency
> > > > when
> > > > > > > > > >
> > > > > > > > > > rebalancing?
> > > > > > > > > > >
> > > > > > > > > > > -- Roman
> > > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >  Andrey Kuznetsov.
> > > > >
> > > >
> > >
> > --
> > Best Regards, Vyacheslav D.
>

Re: GridDhtInvalidPartitionException takes the cluster down

Reply via email to