Re: One failing node stalling the whole cluster

Kristian Rosenvold Mon, 06 Jun 2016 00:28:15 -0700

We're also seeing this total hang of our replicated cache cluster when a
single node goes totally lethargic due to too heavy memory load. The
culprit node typically does not respond to "jstack" due to either excessive
memory load or missing safepoints. Sometimes we need to do kill -9 to get
the node down.


I have been planning to do a jstack on the *remaining* nodes to try to
figure out why they appear to not be timing out the non-responsive node. I
will upgrade to 1.6 and see if I can pinpoint the problem there.

Kristian




2016-06-05 21:18 GMT+02:00 DLopez <d.lope...@gmail.com>:

> Hi Dennis,
> I agree that it shouldn't happen but I have been able to reproduce it in
> other machines consistently and the only "connection" that they have is
> that
> they share the Ignite replicated caches.
>
> One machine is basically reading from several caches and filling up some
> data to be returned, I can have 25 clients requesting some data and
> everything is fine. The other one is a different application, that
> basically
> fills up the replicated caches from the DB but receives no direct requests.
> Someone forgot to control a batch job in this second application and it can
> be run many times, consuming up all the memory in this second application.
> The strange thing is that when the second applications start GCing like
> crazy, the first one starts going slower and slower up to a point when it
> stops answering requests. If I kill -9 the second application, the first
> one
> goes back to normal behaviour immediately and can respond 25 simultaneous
> requests again with normal response times. I can restart the second
> application and repeat the same thing and the behaviour is the same.
>
> So I can tell you it's no application code or garbage collection issue in
> the other app. The batch job in the second app, that we run manually for
> this test, is not replicated and does nothing related to ignite, it does
> not
> even use the replicated caches.
>
> The only thing I can think of that would show this behaviour would be the
> sync. process in the Ignite caches slowing down/stalling the reading of
> values. As the second app. starts experiencing GC issues and slows down the
> Ignite sync. process, then it affects the other apps reading the caches. So
> I was wondering if the sync. mechanism might have some kind of lock on the
> caches that would prevent reading from them.
>
> I'll see if I can replicate it in a small scale experiment, apart from
> testing with ignite 1.6.
>
> Thanks for your input
>
>
>
> --
> View this message in context:
> http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p5432.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>

Re: One failing node stalling the whole cluster

Reply via email to