Re: What happens when a client gets disconnected
Hi, I guess that you should provide the full client and server logs, configuration files and reproducer if it's possible for case when the client node with near cache was able to crush the whole cluster. Looks like it can be the issue here and the best way will be raise the JIRA ticket for it after analyze of provided data. BR, Andrei On 2019/07/31 14:54:42, Matt Nohelty wrote: > Sorry for the long delay in responding to this issue. I will work on> > replicating this issue in a more controlled test environment and try to> > grab thread dumps from there.> > > In a previous post you mentioned that the blocking in this thread dump> > should only happen when a data node is affected which is usually a server> > node and you also said that near cache consistency is observed> > continuously. If we have near caching enabled, does that mean clients> > become data nodes? If that's the case, does that explain why we are seeing> > blocking when a client crashes or hangs?> > > Assuming this is related to near caching, is there any configuration to> > adjust this behavior to give us availability over perfect consistency?> > Having a failure on one client ripple across the entire system and> > effectively take down all other clients of that cluster is a major problem.> > We obviously want to avoid problems like an OOM error or a big GC pause in> > the client application but if these things happen we need to be able to> > absorb these gracefully and limit the blast radius to just that client> > node.> >
What happens when a client gets disconnected
Sorry for the long delay in responding to this issue. I will work on replicating this issue in a more controlled test environment and try to grab thread dumps from there. In a previous post you mentioned that the blocking in this thread dump should only happen when a data node is affected which is usually a server node and you also said that near cache consistency is observed continuously. If we have near caching enabled, does that mean clients become data nodes? If that's the case, does that explain why we are seeing blocking when a client crashes or hangs? Assuming this is related to near caching, is there any configuration to adjust this behavior to give us availability over perfect consistency? Having a failure on one client ripple across the entire system and effectively take down all other clients of that cluster is a major problem. We obviously want to avoid problems like an OOM error or a big GC pause in the client application but if these things happen we need to be able to absorb these gracefully and limit the blast radius to just that client node.
Re: What happens when a client gets disconnected
Hello! Near cache's consistency is observed continuously, so I can see that there can be blocking if node greys out. Can you try and gather thread dumps from nodes during pause? Please collect as many nodes as possible. You can force cache lookups to go to near cache only if you use CachePeekMode.Near. Regards, -- Ilya Kasnacheev чт, 25 апр. 2019 г. в 20:23, MattNohelty : > According to some of our historical metrics, the blocking looks to have > been > approximately a minute but the granularity of that monitoring is not super > precise so I don't have an exact time. I can try to go back to our logs > and > see if I can determine a more accurate period of time. > > How does near caching come into play here? If near caching is enabled for > these caches, which should have been fully populated so I'd expect a cache > hit pretty much ever time, would you expect the client to ever go back out > to the server nodes? Is there a straight forward way to determine if a > cache lookup hit the near cache or if it had to go out to the server nodes? > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >
Re: What happens when a client gets disconnected
According to some of our historical metrics, the blocking looks to have been approximately a minute but the granularity of that monitoring is not super precise so I don't have an exact time. I can try to go back to our logs and see if I can determine a more accurate period of time. How does near caching come into play here? If near caching is enabled for these caches, which should have been fully populated so I'd expect a cache hit pretty much ever time, would you expect the client to ever go back out to the server nodes? Is there a straight forward way to determine if a cache lookup hit the near cache or if it had to go out to the server nodes? -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: What happens when a client gets disconnected
Hello! "threads on all the other clients block for a period of time" - how long is this period of time? It definitely makes sense to try more recent version of Ignite. The thread dump that you have shown should be only waiting for all data nodes, which usually are server nodes, so it's not obvious how it is related to client leaving. Regards, -- Ilya Kasnacheev вт, 23 апр. 2019 г. в 20:50, Matt Nohelty : > What period of time are you asking about? We deploy fairly regularly so > our application servers (i.e. the Ignite clients) get restarted at least > weekly which will trigger a disconnect and reconnect event for each. We > have not noticed any issues during our regular release process but in this > case we are shutting down the Ignite clients gracefully with Ignite#close. > However, it's also possible that something bad happens on an application > servers causing it to crash. This is the scenario where we've seen > blocking across the cluster. We'd obviously like our application servers > to be as independent of one another as possible and it's problematic if an > issue on one server is allowed to ripple across all of them. > > I should have mentioned it in my initial post but we are currently using > version 2.4. I received the following response on my Stack Overflow post: > "When topology changes, partition map exchange is triggered internally. It > blocks all operations on the cluster. Also in old versions ongoing > rebalancing was cancelled. But in the latest versions client > connection/disconnection doesn't affect some processes like this. So, it's > worth trying the most fresh release" > > This comment also mentions PME so it sounds like you both are referencing > the same behavior. However, this comment also states that client > connect/disconnect events do not trigger PME in the more recent versions > of Ignite. Can anyone confirm that this is true, and if so, which version > was this change made in? > > Thank you very much for the help. > > On Tue, Apr 23, 2019 at 10:00 AM Ilya Kasnacheev < > ilya.kasnach...@gmail.com> wrote: > >> Hello! >> >> What's the period of time? >> >> When client disconnects, topology will change, which will trigger waiting >> for PME, which will delay all further operations until PME is finished. >> >> Avoid having short-lived clients. >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> вт, 23 апр. 2019 г. в 03:40, Matt Nohelty : >> >>> I already posted this question to stack overflow here >>> https://stackoverflow.com/questions/55801760/what-happens-in-apache-ignite-when-a-client-gets-disconnected >>> but this mailing list is probably more appropriate. >>> >>> We use Apache Ignite for caching and are seeing some unexpected behavior >>> across all of the clients of cluster when one of the clients fails. The >>> Ignite cluster itself has three servers and there are approximately 12 >>> servers connecting to that cluster as clients. The cluster has persistence >>> disabled and many of the caches have near caching enabled. >>> >>> What we are seeing is that when one of the clients fail (out of memory, >>> high CPU, network connectivity, etc.), threads on all the other clients >>> block for a period of time. During these times, the Ignite servers >>> themselves seem fine but I see things like the following in the logs: >>> >>> Topology snapshot [ver=123, servers=3, clients=11, CPUs=XXX, >>> offheap=XX.XGB, heap=XXX.GB]Topology snapshot [ver=124, servers=3, >>> clients=10, CPUs=XXX, offheap=XX.XGB, heap=XXX.GB] >>> >>> The topology itself is clearly changing when a client >>> connects/disconnects but is there anything happening internally inside the >>> cluster that could cause blocking on other clients? I would expect >>> re-balancing of data when a server disconnects but not a client. >>> >>> From a thread dump, I see many threads stuck in the following state: >>> >>> java.lang.Thread.State: TIMED_WAITING (parking) >>> at sun.misc.Unsafe.park(Native Method)- parking to wait for >>> <0x00078a86ff18> (a java.util.concurrent.CountDownLatch$Sync) >>> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) >>> at >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) >>> at >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) >>> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) >>> at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7452) >>> at >>> org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1056) >>> at >>> org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:733) >>> at >>> org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$8.iterator(IgniteH2Indexing.java:1339) >>> at >>> org.apache.ignite.internal.processors.cache
Re: What happens when a client gets disconnected
What period of time are you asking about? We deploy fairly regularly so our application servers (i.e. the Ignite clients) get restarted at least weekly which will trigger a disconnect and reconnect event for each. We have not noticed any issues during our regular release process but in this case we are shutting down the Ignite clients gracefully with Ignite#close. However, it's also possible that something bad happens on an application servers causing it to crash. This is the scenario where we've seen blocking across the cluster. We'd obviously like our application servers to be as independent of one another as possible and it's problematic if an issue on one server is allowed to ripple across all of them. I should have mentioned it in my initial post but we are currently using version 2.4. I received the following response on my Stack Overflow post: "When topology changes, partition map exchange is triggered internally. It blocks all operations on the cluster. Also in old versions ongoing rebalancing was cancelled. But in the latest versions client connection/disconnection doesn't affect some processes like this. So, it's worth trying the most fresh release" This comment also mentions PME so it sounds like you both are referencing the same behavior. However, this comment also states that client connect/disconnect events do not trigger PME in the more recent versions of Ignite. Can anyone confirm that this is true, and if so, which version was this change made in? Thank you very much for the help. On Tue, Apr 23, 2019 at 10:00 AM Ilya Kasnacheev wrote: > Hello! > > What's the period of time? > > When client disconnects, topology will change, which will trigger waiting > for PME, which will delay all further operations until PME is finished. > > Avoid having short-lived clients. > > Regards, > -- > Ilya Kasnacheev > > > вт, 23 апр. 2019 г. в 03:40, Matt Nohelty : > >> I already posted this question to stack overflow here >> https://stackoverflow.com/questions/55801760/what-happens-in-apache-ignite-when-a-client-gets-disconnected >> but this mailing list is probably more appropriate. >> >> We use Apache Ignite for caching and are seeing some unexpected behavior >> across all of the clients of cluster when one of the clients fails. The >> Ignite cluster itself has three servers and there are approximately 12 >> servers connecting to that cluster as clients. The cluster has persistence >> disabled and many of the caches have near caching enabled. >> >> What we are seeing is that when one of the clients fail (out of memory, >> high CPU, network connectivity, etc.), threads on all the other clients >> block for a period of time. During these times, the Ignite servers >> themselves seem fine but I see things like the following in the logs: >> >> Topology snapshot [ver=123, servers=3, clients=11, CPUs=XXX, offheap=XX.XGB, >> heap=XXX.GB]Topology snapshot [ver=124, servers=3, clients=10, CPUs=XXX, >> offheap=XX.XGB, heap=XXX.GB] >> >> The topology itself is clearly changing when a client >> connects/disconnects but is there anything happening internally inside the >> cluster that could cause blocking on other clients? I would expect >> re-balancing of data when a server disconnects but not a client. >> >> From a thread dump, I see many threads stuck in the following state: >> >> java.lang.Thread.State: TIMED_WAITING (parking) >> at sun.misc.Unsafe.park(Native Method)- parking to wait for >> <0x00078a86ff18> (a java.util.concurrent.CountDownLatch$Sync) >> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) >> at >> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) >> at >> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) >> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) >> at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7452) >> at >> org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1056) >> at >> org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:733) >> at >> org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$8.iterator(IgniteH2Indexing.java:1339) >> at >> org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95) >> at >> org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$9.iterator(IgniteH2Indexing.java:1403) >> at >> org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95) >> at java.lang.Iterable.forEach(Iterable.java:74)... >> >> Any ideas, suggestions, or further avenues to investigate would be much >> appreciated. >> >
Re: What happens when a client gets disconnected
Hello! What's the period of time? When client disconnects, topology will change, which will trigger waiting for PME, which will delay all further operations until PME is finished. Avoid having short-lived clients. Regards, -- Ilya Kasnacheev вт, 23 апр. 2019 г. в 03:40, Matt Nohelty : > I already posted this question to stack overflow here > https://stackoverflow.com/questions/55801760/what-happens-in-apache-ignite-when-a-client-gets-disconnected > but this mailing list is probably more appropriate. > > We use Apache Ignite for caching and are seeing some unexpected behavior > across all of the clients of cluster when one of the clients fails. The > Ignite cluster itself has three servers and there are approximately 12 > servers connecting to that cluster as clients. The cluster has persistence > disabled and many of the caches have near caching enabled. > > What we are seeing is that when one of the clients fail (out of memory, > high CPU, network connectivity, etc.), threads on all the other clients > block for a period of time. During these times, the Ignite servers > themselves seem fine but I see things like the following in the logs: > > Topology snapshot [ver=123, servers=3, clients=11, CPUs=XXX, offheap=XX.XGB, > heap=XXX.GB]Topology snapshot [ver=124, servers=3, clients=10, CPUs=XXX, > offheap=XX.XGB, heap=XXX.GB] > > The topology itself is clearly changing when a client connects/disconnects > but is there anything happening internally inside the cluster that could > cause blocking on other clients? I would expect re-balancing of data when a > server disconnects but not a client. > > From a thread dump, I see many threads stuck in the following state: > > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method)- parking to wait for > <0x00078a86ff18> (a java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7452) > at > org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1056) > at > org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:733) > at > org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$8.iterator(IgniteH2Indexing.java:1339) > at > org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95) > at > org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$9.iterator(IgniteH2Indexing.java:1403) > at > org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95) > at java.lang.Iterable.forEach(Iterable.java:74)... > > Any ideas, suggestions, or further avenues to investigate would be much > appreciated. >
What happens when a client gets disconnected
I already posted this question to stack overflow here https://stackoverflow.com/questions/55801760/what-happens-in-apache-ignite-when-a-client-gets-disconnected but this mailing list is probably more appropriate. We use Apache Ignite for caching and are seeing some unexpected behavior across all of the clients of cluster when one of the clients fails. The Ignite cluster itself has three servers and there are approximately 12 servers connecting to that cluster as clients. The cluster has persistence disabled and many of the caches have near caching enabled. What we are seeing is that when one of the clients fail (out of memory, high CPU, network connectivity, etc.), threads on all the other clients block for a period of time. During these times, the Ignite servers themselves seem fine but I see things like the following in the logs: Topology snapshot [ver=123, servers=3, clients=11, CPUs=XXX, offheap=XX.XGB, heap=XXX.GB]Topology snapshot [ver=124, servers=3, clients=10, CPUs=XXX, offheap=XX.XGB, heap=XXX.GB] The topology itself is clearly changing when a client connects/disconnects but is there anything happening internally inside the cluster that could cause blocking on other clients? I would expect re-balancing of data when a server disconnects but not a client. >From a thread dump, I see many threads stuck in the following state: java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)- parking to wait for <0x00078a86ff18> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7452) at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1056) at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:733) at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$8.iterator(IgniteH2Indexing.java:1339) at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95) at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$9.iterator(IgniteH2Indexing.java:1403) at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:95) at java.lang.Iterable.forEach(Iterable.java:74)... Any ideas, suggestions, or further avenues to investigate would be much appreciated.