Re: Pending Requests queue bloating

2019-10-22 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
Thanks Ilya. 
I've made the change so that BO keeps only a Map, but I'm still seeing the heap 
bloating in my node. I believe its the coordinator node that is seeing a 
bloated heap because at the time I observed its old gen to be very high there 
was no application ingestion happening. Another interesting observation was 
that at that very time all the other nodes received a lot of large messages and 
the coordinator sent a lot of large messages (I see it from 
ClusterLocalNodeMetricsMXBeanImpl).  Yes another observation is that the 
coordinator nodes off-heap usage was significantly higher than that of the 
other nodes (100GB for coordinator versus 35-40GB for others). Why this huge 
difference?

It doesn't seem like its any app data - seems like internal control messages of 
Ignite. Does this symptom fit the bill of the issue Anton pointed to? If it is, 
what is the work around for this issue?  Or what would help is, if you could 
help with explaining the minimal changes I need to make to patch 2.7.5.  

If its not that issue, what else could it be?


From: user@ignite.apache.org At: 10/22/19 05:31:01Cc:  user@ignite.apache.org
Subject: Re: Pending Requests queue bloating

Hello!

Yes, it can be a problem. Moreover, it is not advised using variable 
BinaryObject composition since this may cause Ignite to track large number of 
object schemas, and their propagation is a blocking operation.

It is recommended to put all non-essential/changing fields to Map.

Regards,
-- 
Ilya Kasnacheev


пн, 21 окт. 2019 г. в 19:15, Abhishek Gupta (BLOOMBERG/ 731 LEX) 
:

I should have mentioned I'm using String->BinaryObject in my cache. My binary 
object itself has a large number of field->value pairs (a few thousand). As I 
run my ingestion jobs using datastreamers, depending on the job type some new 
fields might be added to the binary object but eventually after all types of 
jobs have run atleast once, more new fields are rarely added to the 
BinaryObjects. 

Could this be part of the issue I.e. Having large number of fields? Would it 
help with this problem if I simply stored a Map of key-value pairs instead of a 
BinaryObject with a few thousand fields?


Thanks,
Abhishek


From: user@ignite.apache.org At: 10/21/19 11:26:44To:  user@ignite.apache.org
Subject: RE: Pending Requests queue bloating

Thanks Anton for the response.  I'm using 2.7.5. I think you correctly 
identified the issue - I do see MetadataUpdateProposedMessage objects inside. 
What is not very clear is what the trigger for this is, and what the work 
around is?  What would help is, if you could help with explaining the minimal 
changes I need to make to patch 2.7.5.  Or work around it?   

Thanks,
Abhishek


From: user@ignite.apache.org At: 10/16/19 21:14:10To:  Abhishek Gupta 
(BLOOMBERG/ 731 LEX ) ,  user@ignite.apache.org
Subject: RE: Pending Requests queue bloating


Hello,
 
First of all, what is the exact version/build that is being used?
 
I would say that it is hard to precisely identify what is the issue, knowing 
only retained sizes of some objects, but there are several different 
assumptions that may have happened with the cluster. And this queue is not 
cache entries inserted with data streamer, they don’t fall into discovery 
RingMessageWorker as they don’t have to go across the whole server topology.
 
There are couple of issues in Ignite JIRA that are related to memory 
consumption in ServerImpl/ClientImpl, but the one that might possibly fit is: 
https://issues.apache.org/jira/browse/IGNITE-11058 , because others might not 
be related to TcpDiscoveryCustomEventMessage class.
 
If you still have your heap dump available, check for the messages and data 
stored in these custom messages, what kind of messages are there?
 
Since there are some large BinaryMetadata/Holder heap consumption, my guess 
would be that there is something like MetadataUpdateProposedMessage inside, and 
here is another ticket that might be useful to be checked for:
https://issues.apache.org/jira/browse/IGNITE-11531
 
And the last thing, data streamer tuning points are described in the javadoc, 
check for perNodeParallelOperations to do the throttling on the source side: 
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/IgniteDataStreamer.html
 
Regards,
Anton
 

From: Abhishek Gupta (BLOOMBERG/ 731 LEX)
Sent: Thursday, October 17, 2019 1:29 AM
To: user@ignite.apache.org
Subject: Pending Requests queue bloating
 
Hello,

I'm using G1GC with 24G on each of the 6 nodes in my grid. I saw issue while 
ingesting large amounts of data (using DataStreamers) today where the old gen 
kept bloating and GC pauses kept going up until the point where the grid became 
unusable. Looking at the heap dump (attached) of one of the nodes it seems like 
the Pending Messages queue kept bloating to the point where the GC started to 
churn a lot. 

 

Questions - 

1. Given the only operation that were occurring on the grid at the time was 
ingestion

Intermittent "Partition states validation has failed for group" issues

2019-10-21 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
In my otherwise stably running grid (on 2.7.5) I sometimes see intermittent  
GridDhtPartitionsExchangeFuture  warning. This warning the occurs periodically 
and then goes away after some time. I couldn't find any documentation or other 
threads about this warning and its implications. 
* What is the trigger for this warning? 
* What are the implications?
* Is there any recommendation around fixing this issue?


2019-10-21 16:09:44.378 [WARN ] [sys-#26240] GridDhtPartitionsExchangeFuture - 
Partition states validation has failed for group: mainCache. Partitions cache 
sizes are inconsistent for Part 0: [id-dgcasp-ob-398-csp-drp-ny-1=43417 
id-dgcasp-ob-080-csp-drp-ny-1=43416 ] Part 1: 
[id-dgcasp-ob-080-csp-drp-ny-1=43720 id-dgcasp-ob-471-csp-drp-ny-1=43724 ] Part 
2: [id-dgcasp-ob-762-csp-drp-ny-1=43388 id-dgcasp-ob-471-csp-drp-ny-1=43376 ] 
Part 3: [id-dgcasp-ob-775-csp-drp-ny-1=43488 
id-dgcasp-ob-403-csp-drp-ny-1=43484 ] Part 4: 
[id-dgcasp-ob-080-csp-drp-ny-1=43338 id-dgcasp-ob-471-csp-drp-ny-1=43339 ] Part 
5: [id-dgcasp-ob-398-csp-drp-ny-1=43105 id-dgcasp-ob-471-csp-drp-ny-1=43106 ] 
Part 7: [id-dgcasp-ob-775-csp-drp-ny-1=43151 
id-dgcasp-ob-762-csp-drp-ny-1=43157 ] Part 8: 
[id-dgcasp-ob-398-csp-drp-ny-1=42975 id-dgcasp-ob-471-csp-drp-ny-1=42976 ] Part 
10: [id-dgcasp-ob-775-csp-drp-ny-1=43033 id-dgcasp-ob-471-csp-drp-ny-1=43036 ] 
Part 11: [id-dgcasp-ob-762-csp-drp-ny-1=43303 
id-dgcasp-ob-471-csp-drp-ny-1=43299 ] Part 12: 
[id-dgcasp-ob-398-csp-drp-ny-1=43262 id-dgcasp-ob-471-csp-drp-ny-1=43265 ] Part 
13: [id-dgcasp-ob-762-csp-drp-ny-1=43123 id-dgcasp-ob-471-csp-drp-ny-1=43120 ] 
Part 15: [id-dgcasp-ob-775-csp-drp-ny-1=43412 
id-dgcasp-ob-398-csp-drp-ny-1=43413 ] Part 16: 
[id-dgcasp-ob-471-csp-drp-ny-1=43934 id-dgcasp-ob-403-csp-drp-ny-1=43933 ] Part 
20: [id-dgcasp-ob-080-csp-drp-ny-1=43146 id-dgcasp-ob-471-csp-drp-ny-1=43148 ] 
Part 21: [id-dgcasp-ob-762-csp-drp-ny-1=43196 
id-dgcasp-ob-080-csp-drp-ny-1=43197 ] Part 22: 
[id-dgcasp-ob-398-csp-drp-ny-1=43233 id-dgcasp-ob-762-csp-drp-ny-1=43234 ] Part 
23: [id-dgcasp-ob-398-csp-drp-ny-1=43127 id-dgcasp-ob-471-csp-drp-ny-1=43128 ] 
Part 24: [id-dgcasp-ob-775-csp-drp-ny-1=43144 
id-dgcasp-ob-398-csp-drp-ny-1=43142 ]  ... TRUNCATED


Thanks,
Abhishek



RE: Pending Requests queue bloating

2019-10-21 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
I should have mentioned I'm using String->BinaryObject in my cache. My binary 
object itself has a large number of field->value pairs (a few thousand). As I 
run my ingestion jobs using datastreamers, depending on the job type some new 
fields might be added to the binary object but eventually after all types of 
jobs have run atleast once, more new fields are rarely added to the 
BinaryObjects. 

Could this be part of the issue I.e. Having large number of fields? Would it 
help with this problem if I simply stored a Map of key-value pairs instead of a 
BinaryObject with a few thousand fields?


Thanks,
Abhishek


From: user@ignite.apache.org At: 10/21/19 11:26:44To:  user@ignite.apache.org
Subject: RE: Pending Requests queue bloating

Thanks Anton for the response.  I'm using 2.7.5. I think you correctly 
identified the issue - I do see MetadataUpdateProposedMessage objects inside. 
What is not very clear is what the trigger for this is, and what the work 
around is?  What would help is, if you could help with explaining the minimal 
changes I need to make to patch 2.7.5.  Or work around it?   

Thanks,
Abhishek


From: user@ignite.apache.org At: 10/16/19 21:14:10To:  Abhishek Gupta 
(BLOOMBERG/ 731 LEX ) ,  user@ignite.apache.org
Subject: RE: Pending Requests queue bloating


Hello,
 
First of all, what is the exact version/build that is being used?
 
I would say that it is hard to precisely identify what is the issue, knowing 
only retained sizes of some objects, but there are several different 
assumptions that may have happened with the cluster. And this queue is not 
cache entries inserted with data streamer, they don’t fall into discovery 
RingMessageWorker as they don’t have to go across the whole server topology.
 
There are couple of issues in Ignite JIRA that are related to memory 
consumption in ServerImpl/ClientImpl, but the one that might possibly fit is: 
https://issues.apache.org/jira/browse/IGNITE-11058 , because others might not 
be related to TcpDiscoveryCustomEventMessage class.
 
If you still have your heap dump available, check for the messages and data 
stored in these custom messages, what kind of messages are there?
 
Since there are some large BinaryMetadata/Holder heap consumption, my guess 
would be that there is something like MetadataUpdateProposedMessage inside, and 
here is another ticket that might be useful to be checked for:
https://issues.apache.org/jira/browse/IGNITE-11531
 
And the last thing, data streamer tuning points are described in the javadoc, 
check for perNodeParallelOperations to do the throttling on the source side: 
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/IgniteDataStreamer.html
 
Regards,
Anton
 

From: Abhishek Gupta (BLOOMBERG/ 731 LEX)
Sent: Thursday, October 17, 2019 1:29 AM
To: user@ignite.apache.org
Subject: Pending Requests queue bloating
 
Hello,

I'm using G1GC with 24G on each of the 6 nodes in my grid. I saw issue while 
ingesting large amounts of data (using DataStreamers) today where the old gen 
kept bloating and GC pauses kept going up until the point where the grid became 
unusable. Looking at the heap dump (attached) of one of the nodes it seems like 
the Pending Messages queue kept bloating to the point where the GC started to 
churn a lot. 

 

Questions - 

1. Given the only operation that were occurring on the grid at the time was 
ingestion using datastreamer, is this queue basically of those messages?

 

2. What is the recommended solution to this problem? 

a. The CPU usage on the server was very low throughout, so what could be 
causing this queue to bloat? (I'm not using any persistence) 

b. Is there a way to throttle these requests on the server such that the 
clients feel back pressure and this queue doesn't fill up?

 

Anything else you can recommend?

 

Thanks,

Abhishek

 

 

 

 

 
 
 




RE: Pending Requests queue bloating

2019-10-21 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
Thanks Anton for the response.  I'm using 2.7.5. I think you correctly 
identified the issue - I do see MetadataUpdateProposedMessage objects inside. 
What is not very clear is what the trigger for this is, and what the work 
around is?  What would help is, if you could help with explaining the minimal 
changes I need to make to patch 2.7.5.  Or work around it?   

Thanks,
Abhishek


From: user@ignite.apache.org At: 10/16/19 21:14:10To:  Abhishek Gupta 
(BLOOMBERG/ 731 LEX ) ,  user@ignite.apache.org
Subject: RE: Pending Requests queue bloating


Hello,
 
First of all, what is the exact version/build that is being used?
 
I would say that it is hard to precisely identify what is the issue, knowing 
only retained sizes of some objects, but there are several different 
assumptions that may have happened with the cluster. And this queue is not 
cache entries inserted with data streamer, they don’t fall into discovery 
RingMessageWorker as they don’t have to go across the whole server topology.
 
There are couple of issues in Ignite JIRA that are related to memory 
consumption in ServerImpl/ClientImpl, but the one that might possibly fit is: 
https://issues.apache.org/jira/browse/IGNITE-11058 , because others might not 
be related to TcpDiscoveryCustomEventMessage class.
 
If you still have your heap dump available, check for the messages and data 
stored in these custom messages, what kind of messages are there?
 
Since there are some large BinaryMetadata/Holder heap consumption, my guess 
would be that there is something like MetadataUpdateProposedMessage inside, and 
here is another ticket that might be useful to be checked for:
https://issues.apache.org/jira/browse/IGNITE-11531
 
And the last thing, data streamer tuning points are described in the javadoc, 
check for perNodeParallelOperations to do the throttling on the source side: 
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/IgniteDataStreamer.html
 
Regards,
Anton
 

From: Abhishek Gupta (BLOOMBERG/ 731 LEX)
Sent: Thursday, October 17, 2019 1:29 AM
To: user@ignite.apache.org
Subject: Pending Requests queue bloating
 
Hello,

I'm using G1GC with 24G on each of the 6 nodes in my grid. I saw issue while 
ingesting large amounts of data (using DataStreamers) today where the old gen 
kept bloating and GC pauses kept going up until the point where the grid became 
unusable. Looking at the heap dump (attached) of one of the nodes it seems like 
the Pending Messages queue kept bloating to the point where the GC started to 
churn a lot. 

 

Questions - 

1. Given the only operation that were occurring on the grid at the time was 
ingestion using datastreamer, is this queue basically of those messages?

 

2. What is the recommended solution to this problem? 

a. The CPU usage on the server was very low throughout, so what could be 
causing this queue to bloat? (I'm not using any persistence) 

b. Is there a way to throttle these requests on the server such that the 
clients feel back pressure and this queue doesn't fill up?

 

Anything else you can recommend?

 

Thanks,

Abhishek

 

 

 

 

 
 
 



Throttling/ Queue length for DataStreamers

2019-09-27 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
Hello,
 I'm using datastreamers to ingest large amounts of data in batches. So the 
load on the grid is pretty spiky Some time I'm seeing pretty heavy GCing and 
that causes the ingestion to slow down on the grid, but the client continues to 
pump data which makes the GC pauses worse because I suspect that the queues on 
the grid keep bloating with requests and it really gets into a death spiral 
sometimes. It seems like having some throttling will help with these scenarios. 
Two questions - 


1. Is there a way to see the length of the MSG queue building for datastreamers 
2. Is there a way to throttle this? I.e. Set a max queue size or some way to 
slow down the data streaming clients?

Thanks,
Abhishek



Re: Grid suddenly went in bad state

2019-09-26 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
Thanks for the response Ilya. 
So from a sequence of events perspective, first the logs show "Partition states 
validation has failed for group" for many minutes. And only after that we see 
the  "Failed to read data from remote connection" caused by 
"java.nio.channels.ClosedChannelException".  So the question remains - what 
could cause  "Partition states validation has failed for group" in the first 
place? 

Will also appreciate insights into my question 2. Below about 'client' being 
nominator the coordinator. Is that by design?  

Thanks,
Abhishek


From: ilya.kasnach...@gmail.com At: 09/26/19 11:33:36To:  Abhishek Gupta 
(BLOOMBERG/ 731 LEX ) 
Cc:  user@ignite.apache.org
Subject: Re: Grid suddenly went in bad state

Hello!

"Failed to read data from remote connection" in absence of other errors points 
to potential network problems. Maybe you have short idle timeout for TCP 
connections? Maybe they get blockaded?

Regards,
-- 
Ilya Kasnacheev


вт, 24 сент. 2019 г. в 20:46, Abhishek Gupta (BLOOMBERG/ 731 LEX) 
:

Hello Folks,
  Would really appreciate any suggestions you could provide about the below.


Thanks,
Abhishek

From: user@ignite.apache.org At: 09/20/19 15:11:33To:  user@ignite.apache.org
Subject: Re: Grid suddenly went in bad state


Find attached the logs from 3 of the nodes and their GC graphs. The logs from 
the other nodes look pretty much the same. 

Some questions - 
1. What could be the trigger for the "Partition states validation has failed 
for group"  in node 1 ? Seems like it came on suddenly
2. If you look at the logs, there seems to be a change in coordinator 
   3698 2019-09-19 15:07:04.487 [INFO ] [disco-event-worker-#175] 
GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode 
[id=d667641c-3213-42ce-aea7-2fa232e972d6, addrs=[10.115.226.147, 127.0.0.1, 
10.126.191.211], order=91, loc=false, client=true], cur=ZookeeperCluste
rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 
10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]
   3713 2019-09-19 15:09:19.813 [INFO ] [disco-event-worker-#175] 
GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode 
[id=2c4a25d1-7701-407f-b728-4d9bcef3cb5b, addrs=[10.115.226.148, 
10.126.191.212, 127.0.0.1], order=94, loc=false, client=true], 
cur=ZookeeperClusterNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, 
addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false, 
client=false]]

What is curious is that it seems to suggest, a client was a coordinator. Is 
that by design? Clients are allowed to be coordinators?


3. It just seems like the grid went into a tailspin as show in the logs for 
node 1.  Any help in understanding what triggered these series of event will be 
very helpful.


Thanks,
Abhishek


From: user@ignite.apache.org At: 09/20/19 05:24:59To:  user@ignite.apache.org
Subject: Re: Grid suddenly went in bad state

Hi,

Could please also attach logs for other nodes? And what version of Ignite 
you're currently using?

Also you've mentioned high GC activity, is it possible to provide GC logs?

Regards,
Igor
On Fri, Sep 20, 2019 at 1:17 AM Abhishek Gupta (BLOOMBERG/ 731 LEX) 
 wrote:

Hello,
  I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each. 
The grid seemed to be performing normally until at one point it started logging 
"Partition states validation has failed for group" warning - see attached log 
file.  This kept happening for about 3 minutes and then stopped (see line 85 in 
the attached log file).  Just then a client seems to have connected (see line 
135 where connection was accepted). But soon after, it kept logging the below 
exception. After a while (~1 hour), it started showing logging "Partition 
states validation has failed for group" again (line 284). 


2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] 
GridDhtPartitionsExchangeFuture - Completed partition exchange 
[localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, 
exchange=GridDhtPartitionsExchangeFuture [topVer
=AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED, 
evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2, 
addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126, 
minorTopVer=0], durationFromInit=0]
2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished exchange 
init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], crd=true]
2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176] 
GridCachePartitionExchangeManager - Skipping rebalancing (nothing scheduled) 
[top=AffinityTopologyVersion [topVer=126, minorTopVer=0], force=false, 
evt=NODE_JOI
NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130] 
TcpCommunicationSpi - Accepted incoming communication 

Re:Grid suddenly went in bad state

2019-09-19 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
Attached now.

From: Abhishek Gupta (BLOOMBERG/ 731 LEX) At: 09/19/19 18:17:18To:  
user@ignite.apache.org
Subject: Grid suddenly went in bad state

Hello,
  I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each. 
The grid seemed to be performing normally until at one point it started logging 
"Partition states validation has failed for group" warning - see attached log 
file.  This kept happening for about 3 minutes and then stopped (see line 85 in 
the attached log file).  Just then a client seems to have connected (see line 
135 where connection was accepted). But soon after, it kept logging the below 
exception. After a while (~1 hour), it started showing logging "Partition 
states validation has failed for group" again (line 284). 


2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] 
GridDhtPartitionsExchangeFuture - Completed partition exchange 
[localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, 
exchange=GridDhtPartitionsExchangeFuture [topVer
=AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED, 
evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2, 
addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126, 
minorTopVer=0], durationFromInit=0]
2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished exchange 
init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], crd=true]
2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176] 
GridCachePartitionExchangeManager - Skipping rebalancing (nothing scheduled) 
[top=AffinityTopologyVersion [topVer=126, minorTopVer=0], force=false, 
evt=NODE_JOI
NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130] 
TcpCommunicationSpi - Accepted incoming communication connection 
[locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45464]
2019-09-19 13:28:29.540 [INFO ] [grid-nio-worker-tcp-comm-15-#131] 
TcpCommunicationSpi - Accepted incoming communication connection 
[locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45466]
2019-09-19 13:28:29.600 [INFO ] [grid-nio-worker-tcp-comm-16-#132] 
TcpCommunicationSpi - Accepted incoming communication connection 
[locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45472]
2019-09-19 13:28:51.624 [ERROR] [grid-nio-worker-tcp-comm-17-#133] 
TcpCommunicationSpi - Failed to read data from remote connection (will wait for 
2000ms).
org.apache.ignite.IgniteCheckedException: Failed to select events on selector.
at 
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2182)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) 
[ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: java.nio.channels.ClosedChannelException
at 
java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197)
 ~[?:1.8.0_172]
at 
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5]
... 3 more


after a lot of these exceptions and warnings, the node started throwing the 
below (client had started ingestion using datastreamer). And the below 
exceptions were seen on all the nodes

2019-09-19 15:10:38.922 [ERROR] [grid-timeout-worker-#115]  - Critical system 
error detected. Will be handled accordingly to configured handler 
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=Abst
ractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker 
[name=data-str
eamer-stripe-42, igniteInstanceName=null, finished=false, 
heartbeatTs=1568920228643]]]
org.apache.ignite.IgniteException: GridWorker [name=data-streamer-stripe-42, 
igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) 
[ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:221)
 [ignite-core

Grid suddenly went in bad state

2019-09-19 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
Hello,
  I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each. 
The grid seemed to be performing normally until at one point it started logging 
"Partition states validation has failed for group" warning - see attached log 
file.  This kept happening for about 3 minutes and then stopped (see line 85 in 
the attached log file).  Just then a client seems to have connected (see line 
135 where connection was accepted). But soon after, it kept logging the below 
exception. After a while (~1 hour), it started showing logging "Partition 
states validation has failed for group" again (line 284). 


2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] 
GridDhtPartitionsExchangeFuture - Completed partition exchange 
[localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, 
exchange=GridDhtPartitionsExchangeFuture [topVer
=AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED, 
evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2, 
addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126, 
minorTopVer=0], durationFromInit=0]
2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished exchange 
init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], crd=true]
2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176] 
GridCachePartitionExchangeManager - Skipping rebalancing (nothing scheduled) 
[top=AffinityTopologyVersion [topVer=126, minorTopVer=0], force=false, 
evt=NODE_JOI
NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130] 
TcpCommunicationSpi - Accepted incoming communication connection 
[locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45464]
2019-09-19 13:28:29.540 [INFO ] [grid-nio-worker-tcp-comm-15-#131] 
TcpCommunicationSpi - Accepted incoming communication connection 
[locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45466]
2019-09-19 13:28:29.600 [INFO ] [grid-nio-worker-tcp-comm-16-#132] 
TcpCommunicationSpi - Accepted incoming communication connection 
[locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45472]
2019-09-19 13:28:51.624 [ERROR] [grid-nio-worker-tcp-comm-17-#133] 
TcpCommunicationSpi - Failed to read data from remote connection (will wait for 
2000ms).
org.apache.ignite.IgniteCheckedException: Failed to select events on selector.
at 
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2182)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) 
[ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: java.nio.channels.ClosedChannelException
at 
java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197)
 ~[?:1.8.0_172]
at 
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5]
... 3 more


after a lot of these exceptions and warnings, the node started throwing the 
below (client had started ingestion using datastreamer). And the below 
exceptions were seen on all the nodes

2019-09-19 15:10:38.922 [ERROR] [grid-timeout-worker-#115]  - Critical system 
error detected. Will be handled accordingly to configured handler 
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=Abst
ractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker 
[name=data-str
eamer-stripe-42, igniteInstanceName=null, finished=false, 
heartbeatTs=1568920228643]]]
org.apache.ignite.IgniteException: GridWorker [name=data-streamer-stripe-42, 
igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) 
[ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:221)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) 
[ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]


There 

Re: Grid failure on frequent cache creation/destroying

2019-09-11 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
Thanks Ilya! Good to have confirmation.


From: ilya.kasnach...@gmail.com At: 09/11/19 11:57:16To:  Abhishek Gupta 
(BLOOMBERG/ 731 LEX ) 
Cc:  user@ignite.apache.org
Subject: Re: Grid failure on frequent cache creation/destroying

Hello!

I'm afraid that's https://issues.apache.org/jira/browse/IGNITE-12013

There's a maillist thread attached.

Regards,
-- 
Ilya Kasnacheev


вт, 10 сент. 2019 г. в 00:33, Abhishek Gupta (BLOOMBERG/ 731 LEX) 
:

Hello,
 We have a grid of 6 nodes with a main cache. We noticed something 
interesting today where while the regular ingestion was on with the mainCache. 
We have an operational tool that created and destroys a cache 
(tempCacheByExplorerApp) using REST API on each of the 6 nodes. While doing 
this today, all the nodes hit a critical error and died. Attached is a log 
snippet from one of the nodes when this happened. 

This issue seems like the one described @ 
http://apache-ignite-users.70518.x6.nabble.com/Ignite-2-7-0-server-node-null-pointer-exception-td28899.html
  but there's isn't any topology change as such happening here, just cache 
creation/destruction.


Appreciate your help.

-Abhi




Grid failure on frequent cache creation/destroying

2019-09-09 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
Hello,
 We have a grid of 6 nodes with a main cache. We noticed something 
interesting today where while the regular ingestion was on with the mainCache. 
We have an operational tool that created and destroys a cache 
(tempCacheByExplorerApp) using REST API on each of the 6 nodes. While doing 
this today, all the nodes hit a critical error and died. Attached is a log 
snippet from one of the nodes when this happened. 

This issue seems like the one described @ 
http://apache-ignite-users.70518.x6.nabble.com/Ignite-2-7-0-server-node-null-pointer-exception-td28899.html
  but there's isn't any topology change as such happening here, just cache 
creation/destruction.


Appreciate your help.

-Abhi



ignite-log-forum.log
Description: Binary data


ZooKeeper Discovery - Handling large number of znodes and their cleanup

2019-08-21 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
Hello,
I'm using ZK based discovery for my 6 node grid. Its been working smoothly 
for a while until suddenly my ZK node went OOM. Turns out there were 1000s of 
znodes, many with data about ~1M  + there were suddenly a lot of stuff ZK 
requests (tx log was huge).   

One symptom on the grid to notes is that when this happened my nodes were 
heavily stalling (this is a separate issue to discuss - they're stalling with 
lots of high JVM pauses but GC logs appear alright) and were also getting heavy 
write from DataStreamers. 

I see the joinData znode having many 1000s of persistent children. I'd like to 
undersstand why so many znodes were created under 'jd' and what's the best way 
to prevent this and clean up these child nodes under jd.


Thanks,
Abhishek




Re: Rebalancing only to backups

2019-08-15 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
Exactly - what is available for persistence, I was wondering if it is available 
for in-mem only.  So for now I'll just need to configure rebalance mode to 
non-NONE and live with point 1.  

Thanks evgenii!


From: e.zhuravlev...@gmail.com At: 08/15/19 11:37:16To:  Abhishek Gupta 
(BLOOMBERG/ 731 LEX ) ,  user@ignite.apache.org
Subject: Re: Rebalancing only to backups

Hi Abhishek,

That's how it works now if you have enabled Persistence. Actually, that's the 
main reason why BaselineTopology was introduced - we don't want to move a lot 
of data between nodes if we know that node will return soon after a failure: 
https://apacheignite.readme.io/docs/baseline-topology
If you want to force rebalance, you can manually change the BaselineTopology.

As far as I know, BaselineTopology concept was also introduced for in-memory 
caches and will be released as a part of Apache Ignite 2.8. Also, there will be 
some configurable timeout, after which baseline topology will be changed 
automatically if you want it.

Best Regards,
Evgenii
чт, 15 авг. 2019 г. в 17:16, Abhishek Gupta (BLOOMBERG/ 731 LEX) 
:

(pardon me if this mail is by chance a duplicate - it got bounced back when I 
sent it earlier from nabble)

Hello,
 I have 6 node grid and I've configured it with 1 backup. I want to have
partition rebalancing but only in the following way.
If one of the six nodes goes down, then some primary and backup partitions
go down with it but there is no data loss since the backup for those are
present on one of the other five nodes. So there is only single copy for
these paritions

i. At this point I do not want that the 5 nodes rebalance all the paritions
amongst themselves such that each one has 1 primary and 1 backup
ii. When this 6th node comes back up, I want the partitions in the other
nodes which are living with only a single copy, so hydrate this fresh 6th
one with copies of the partitions that has only one copy before.

Why i? Because I don't want a situation of cascading OOMs 
Why ii? Obvious reason so as to have the 2nd copy for all partitions.

Is this possible? If not what's the best way to come close to this?

Thanks,
Abhi




Rebalancing only to backups

2019-08-15 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
(pardon me if this mail is by chance a duplicate - it got bounced back when I 
sent it earlier from nabble)

Hello,
 I have 6 node grid and I've configured it with 1 backup. I want to have
partition rebalancing but only in the following way.
If one of the six nodes goes down, then some primary and backup partitions
go down with it but there is no data loss since the backup for those are
present on one of the other five nodes. So there is only single copy for
these paritions

i. At this point I do not want that the 5 nodes rebalance all the paritions
amongst themselves such that each one has 1 primary and 1 backup
ii. When this 6th node comes back up, I want the partitions in the other
nodes which are living with only a single copy, so hydrate this fresh 6th
one with copies of the partitions that has only one copy before.

Why i? Because I don't want a situation of cascading OOMs 
Why ii? Obvious reason so as to have the 2nd copy for all partitions.

Is this possible? If not what's the best way to come close to this?

Thanks,
Abhi

Issue with peerclassloading when using dataStreamers

2019-05-09 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)
I'm using datastreamers to ingest data from files into the cache. I've a need 
to do an 'upsertion' to the data being ingested, therefore I'm using 
streamReceiver too. 

See attached java class and log snippet.  When we run the code calling addData 
on the datastreamer after a while, we start seeing exceptions being thrown. But 
this doesn't happen always. I.e. Peerclassloading seems to work at times and 
not at others with no code change. It seems like there is a race.  

Any suggestions on what might be happening or is there a better, more reliable 
way to do this?


Thanks,
Abhishek



StreamReceiver-Log.txt
Description: Binary data


CassStreamReceiverPeerLoading.java
Description: Binary data