Re: What happens when a client gets disconnected

2019-08-06 Thread Andrei Aleksandrov

Hi,

I guess that you should provide the full client and server logs, 
configuration files and reproducer if it's possible for case when the 
client node with near cache was able to crush the whole cluster.


Looks like it can be the issue here and the best way will be raise the 
JIRA ticket for it after analyze of provided data.


BR,
Andrei

On 2019/07/31 14:54:42, Matt Nohelty  wrote:
> Sorry for the long delay in responding to this issue. I will work on>
> replicating this issue in a more controlled test environment and try to>
> grab thread dumps from there.>
>
> In a previous post you mentioned that the blocking in this thread dump>
> should only happen when a data node is affected which is usually a 
server>

> node and you also said that near cache consistency is observed>
> continuously. If we have near caching enabled, does that mean clients>
> become data nodes? If that's the case, does that explain why we are 
seeing>

> blocking when a client crashes or hangs?>
>
> Assuming this is related to near caching, is there any configuration to>
> adjust this behavior to give us availability over perfect consistency?>
> Having a failure on one client ripple across the entire system and>
> effectively take down all other clients of that cluster is a major 
problem.>
> We obviously want to avoid problems like an OOM error or a big GC 
pause in>

> the client application but if these things happen we need to be able to>
> absorb these gracefully and limit the blast radius to just that client>
> node.>
>


What happens when a client gets disconnected

2019-07-31 Thread Matt Nohelty
Sorry for the long delay in responding to this issue.  I will work on
replicating this issue in a more controlled test environment and try to
grab thread dumps from there.

In a previous post you mentioned that the blocking in this thread dump
should only happen when a data node is affected which is usually a server
node and you also said that near cache consistency is observed
continuously.  If we have near caching enabled, does that mean clients
become data nodes?  If that's the case, does that explain why we are seeing
blocking when a client crashes or hangs?

Assuming this is related to near caching, is there any configuration to
adjust this behavior to give us availability over perfect consistency?
Having a failure on one client ripple across the entire system and
effectively take down all other clients of that cluster is a major problem.
We obviously want to avoid problems like an OOM error or a big GC pause in
the client application but if these things happen we need to be able to
absorb these gracefully and limit the blast radius to just that client
node.


Re: What happens when a client gets disconnected

2019-04-26 Thread Ilya Kasnacheev
Hello!

Near cache's consistency is observed continuously, so I can see that there
can be blocking if node greys out.

Can you try and gather thread dumps from nodes during pause? Please collect
as many nodes as possible.

You can force cache lookups to go to near cache only if you use
CachePeekMode.Near.

Regards,
-- 
Ilya Kasnacheev


чт, 25 апр. 2019 г. в 20:23, MattNohelty :

> According to some of our historical metrics, the blocking looks to have
> been
> approximately a minute but the granularity of that monitoring is not super
> precise so I don't have an exact time.  I can try to go back to our logs
> and
> see if I can determine a more accurate period of time.
>
> How does near caching come into play here?  If near caching is enabled for
> these caches, which should have been fully populated so I'd expect a cache
> hit pretty much ever time, would you expect the client to ever go back out
> to the server nodes?  Is there a straight forward way to determine if a
> cache lookup hit the near cache or if it had to go out to the server nodes?
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: What happens when a client gets disconnected

2019-04-25 Thread MattNohelty
According to some of our historical metrics, the blocking looks to have been
approximately a minute but the granularity of that monitoring is not super
precise so I don't have an exact time.  I can try to go back to our logs and
see if I can determine a more accurate period of time.

How does near caching come into play here?  If near caching is enabled for
these caches, which should have been fully populated so I'd expect a cache
hit pretty much ever time, would you expect the client to ever go back out
to the server nodes?  Is there a straight forward way to determine if a
cache lookup hit the near cache or if it had to go out to the server nodes?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: What happens when a client gets disconnected

2019-04-25 Thread Ilya Kasnacheev
Hello!

"threads on all the other clients block for a period of time" - how long is
this period of time?

It definitely makes sense to try more recent version of Ignite.

The thread dump that you have shown should be only waiting for all data
nodes, which usually are server nodes, so it's not obvious how it is
related to client leaving.

Regards,
-- 
Ilya Kasnacheev


вт, 23 апр. 2019 г. в 20:50, Matt Nohelty :

> What period of time are you asking about?  We deploy fairly regularly so
> our application servers (i.e. the Ignite clients) get restarted at least
> weekly which will trigger a disconnect and reconnect event for each.  We
> have not noticed any issues during our regular release process but in this
> case we are shutting down the Ignite clients gracefully with Ignite#close.
> However, it's also possible that something bad happens on an application
> servers causing it to crash.  This is the scenario where we've seen
> blocking across the cluster.  We'd obviously like our application servers
> to be as independent of one another as possible and it's problematic if an
> issue on one server is allowed to ripple across all of them.
>
> I should have mentioned it in my initial post but we are currently using
> version 2.4.  I received the following response on my Stack Overflow post:
> "When topology changes, partition map exchange is triggered internally. It
> blocks all operations on the cluster. Also in old versions ongoing
> rebalancing was cancelled. But in the latest versions client
> connection/disconnection doesn't affect some processes like this. So, it's
> worth trying the most fresh release"
>
> This comment also mentions PME so it sounds like you both are referencing
> the same behavior.  However, this comment also states that client
> connect/disconnect events do not trigger  PME in the more recent versions
> of Ignite.  Can anyone confirm that this is true, and if so, which version
> was this change made in?
>
> Thank you very much for the help.
>
> On Tue, Apr 23, 2019 at 10:00 AM Ilya Kasnacheev <
> ilya.kasnach...@gmail.com> wrote:
>
>> Hello!
>>
>> What's the period of time?
>>
>> When client disconnects, topology will change, which will trigger waiting
>> for PME, which will delay all further operations until PME is finished.
>>
>> Avoid having short-lived clients.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> вт, 23 апр. 2019 г. в 03:40, Matt Nohelty :
>>
>>> I already posted this question to stack overflow here
>>> https://stackoverflow.com/questions/55801760/what-happens-in-apache-ignite-when-a-client-gets-disconnected
>>> but this mailing list is probably more appropriate.
>>>
>>> We use Apache Ignite for caching and are seeing some unexpected behavior
>>> across all of the clients of cluster when one of the clients fails. The
>>> Ignite cluster itself has three servers and there are approximately 12
>>> servers connecting to that cluster as clients. The cluster has persistence
>>> disabled and many of the caches have near caching enabled.
>>>
>>> What we are seeing is that when one of the clients fail (out of memory,
>>> high CPU, network connectivity, etc.), threads on all the other clients
>>> block for a period of time. During these times, the Ignite servers
>>> themselves seem fine but I see things like the following in the logs:
>>>
>>> Topology snapshot [ver=123, servers=3, clients=11, CPUs=XXX, 
>>> offheap=XX.XGB, heap=XXX.GB]Topology snapshot [ver=124, servers=3, 
>>> clients=10, CPUs=XXX, offheap=XX.XGB, heap=XXX.GB]
>>>
>>> The topology itself is clearly changing when a client
>>> connects/disconnects but is there anything happening internally inside the
>>> cluster that could cause blocking on other clients? I would expect
>>> re-balancing of data when a server disconnects but not a client.
>>>
>>> From a thread dump, I see many threads stuck in the following state:
>>>
>>> java.lang.Thread.State: TIMED_WAITING (parking)
>>> at sun.misc.Unsafe.park(Native Method)- parking to wait for  
>>> <0x00078a86ff18> (a java.util.concurrent.CountDownLatch$Sync)
>>> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>>> at 
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>>> at 
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>>> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
>>> at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7452)
>>> at 
>>> org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1056)
>>> at 
>>> org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:733)
>>> at 
>>> org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$8.iterator(IgniteH2Indexing.java:1339)
>>> at 
>>> org.apache.ignite.internal.processors.cache

Re: What happens when a client gets disconnected

2019-04-23 Thread Matt Nohelty
What period of time are you asking about?  We deploy fairly regularly so
our application servers (i.e. the Ignite clients) get restarted at least
weekly which will trigger a disconnect and reconnect event for each.  We
have not noticed any issues during our regular release process but in this
case we are shutting down the Ignite clients gracefully with Ignite#close.
However, it's also possible that something bad happens on an application
servers causing it to crash.  This is the scenario where we've seen
blocking across the cluster.  We'd obviously like our application servers
to be as independent of one another as possible and it's problematic if an
issue on one server is allowed to ripple across all of them.

I should have mentioned it in my initial post but we are currently using
version 2.4.  I received the following response on my Stack Overflow post:
"When topology changes, partition map exchange is triggered internally. It
blocks all operations on the cluster. Also in old versions ongoing
rebalancing was cancelled. But in the latest versions client
connection/disconnection doesn't affect some processes like this. So, it's
worth trying the most fresh release"

This comment also mentions PME so it sounds like you both are referencing
the same behavior.  However, this comment also states that client
connect/disconnect events do not trigger  PME in the more recent versions
of Ignite.  Can anyone confirm that this is true, and if so, which version
was this change made in?

Thank you very much for the help.

On Tue, Apr 23, 2019 at 10:00 AM Ilya Kasnacheev 
wrote:

> Hello!
>
> What's the period of time?
>
> When client disconnects, topology will change, which will trigger waiting
> for PME, which will delay all further operations until PME is finished.
>
> Avoid having short-lived clients.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> вт, 23 апр. 2019 г. в 03:40, Matt Nohelty :
>
>> I already posted this question to stack overflow here
>> https://stackoverflow.com/questions/55801760/what-happens-in-apache-ignite-when-a-client-gets-disconnected
>> but this mailing list is probably more appropriate.
>>
>> We use Apache Ignite for caching and are seeing some unexpected behavior
>> across all of the clients of cluster when one of the clients fails. The
>> Ignite cluster itself has three servers and there are approximately 12
>> servers connecting to that cluster as clients. The cluster has persistence
>> disabled and many of the caches have near caching enabled.
>>
>> What we are seeing is that when one of the clients fail (out of memory,
>> high CPU, network connectivity, etc.), threads on all the other clients
>> block for a period of time. During these times, the Ignite servers
>> themselves seem fine but I see things like the following in the logs:
>>
>> Topology snapshot [ver=123, servers=3, clients=11, CPUs=XXX, offheap=XX.XGB, 
>> heap=XXX.GB]Topology snapshot [ver=124, servers=3, clients=10, CPUs=XXX, 
>> offheap=XX.XGB, heap=XXX.GB]
>>
>> The topology itself is clearly changing when a client
>> connects/disconnects but is there anything happening internally inside the
>> cluster that could cause blocking on other clients? I would expect
>> re-balancing of data when a server disconnects but not a client.
>>
>> From a thread dump, I see many threads stuck in the following state:
>>
>> java.lang.Thread.State: TIMED_WAITING (parking)
>> at sun.misc.Unsafe.park(Native Method)- parking to wait for  
>> <0x00078a86ff18> (a java.util.concurrent.CountDownLatch$Sync)
>> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>> at 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>> at 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
>> at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7452)
>> at 
>> org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1056)
>> at 
>> org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:733)
>> at 
>> org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$8.iterator(IgniteH2Indexing.java:1339)
>> at 
>> org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95)
>> at 
>> org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$9.iterator(IgniteH2Indexing.java:1403)
>> at 
>> org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95)
>> at java.lang.Iterable.forEach(Iterable.java:74)...
>>
>> Any ideas, suggestions, or further avenues to investigate would be much
>> appreciated.
>>
>


Re: What happens when a client gets disconnected

2019-04-23 Thread Ilya Kasnacheev
Hello!

What's the period of time?

When client disconnects, topology will change, which will trigger waiting
for PME, which will delay all further operations until PME is finished.

Avoid having short-lived clients.

Regards,
-- 
Ilya Kasnacheev


вт, 23 апр. 2019 г. в 03:40, Matt Nohelty :

> I already posted this question to stack overflow here
> https://stackoverflow.com/questions/55801760/what-happens-in-apache-ignite-when-a-client-gets-disconnected
> but this mailing list is probably more appropriate.
>
> We use Apache Ignite for caching and are seeing some unexpected behavior
> across all of the clients of cluster when one of the clients fails. The
> Ignite cluster itself has three servers and there are approximately 12
> servers connecting to that cluster as clients. The cluster has persistence
> disabled and many of the caches have near caching enabled.
>
> What we are seeing is that when one of the clients fail (out of memory,
> high CPU, network connectivity, etc.), threads on all the other clients
> block for a period of time. During these times, the Ignite servers
> themselves seem fine but I see things like the following in the logs:
>
> Topology snapshot [ver=123, servers=3, clients=11, CPUs=XXX, offheap=XX.XGB, 
> heap=XXX.GB]Topology snapshot [ver=124, servers=3, clients=10, CPUs=XXX, 
> offheap=XX.XGB, heap=XXX.GB]
>
> The topology itself is clearly changing when a client connects/disconnects
> but is there anything happening internally inside the cluster that could
> cause blocking on other clients? I would expect re-balancing of data when a
> server disconnects but not a client.
>
> From a thread dump, I see many threads stuck in the following state:
>
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)- parking to wait for  
> <0x00078a86ff18> (a java.util.concurrent.CountDownLatch$Sync)
> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
> at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7452)
> at 
> org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1056)
> at 
> org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:733)
> at 
> org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$8.iterator(IgniteH2Indexing.java:1339)
> at 
> org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95)
> at 
> org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$9.iterator(IgniteH2Indexing.java:1403)
> at 
> org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95)
> at java.lang.Iterable.forEach(Iterable.java:74)...
>
> Any ideas, suggestions, or further avenues to investigate would be much
> appreciated.
>


What happens when a client gets disconnected

2019-04-22 Thread Matt Nohelty
I already posted this question to stack overflow here
https://stackoverflow.com/questions/55801760/what-happens-in-apache-ignite-when-a-client-gets-disconnected
but this mailing list is probably more appropriate.

We use Apache Ignite for caching and are seeing some unexpected behavior
across all of the clients of cluster when one of the clients fails. The
Ignite cluster itself has three servers and there are approximately 12
servers connecting to that cluster as clients. The cluster has persistence
disabled and many of the caches have near caching enabled.

What we are seeing is that when one of the clients fail (out of memory,
high CPU, network connectivity, etc.), threads on all the other clients
block for a period of time. During these times, the Ignite servers
themselves seem fine but I see things like the following in the logs:

Topology snapshot [ver=123, servers=3, clients=11, CPUs=XXX,
offheap=XX.XGB, heap=XXX.GB]Topology snapshot [ver=124, servers=3,
clients=10, CPUs=XXX, offheap=XX.XGB, heap=XXX.GB]

The topology itself is clearly changing when a client connects/disconnects
but is there anything happening internally inside the cluster that could
cause blocking on other clients? I would expect re-balancing of data when a
server disconnects but not a client.

>From a thread dump, I see many threads stuck in the following state:

java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)- parking to wait for
<0x00078a86ff18> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7452)
at 
org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1056)
at 
org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:733)
at 
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$8.iterator(IgniteH2Indexing.java:1339)
at 
org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95)
at 
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$9.iterator(IgniteH2Indexing.java:1403)
at 
org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95)
at java.lang.Iterable.forEach(Iterable.java:74)...

Any ideas, suggestions, or further avenues to investigate would be much
appreciated.