Re: "Adding entry to partition that is concurrently evicted" error

2020-02-03 Thread Abhishek Gupta (BLOOMBERG/ 919 3RD A)
Thanks Andrei.  Looking at my exception (see below), it seem like it is related 
to https://issues.apache.org/jira/browse/IGNITE-11620 in that it occurred while 
expiration was going on. 

1. As a workaround, would it be valid to increase my ttl to reduce the 
possibility of this occurring ? 
2. My worry about using "NoOpFailureHandler" is that the error would still have 
occurred and it might have put the node in a bad situation which might be just 
as bad or worse than just killing the node. 

If you can confirm 1. is a valid line of defense (albeit not air-tight), that 
would be great.

Thanks,
Abhishek

P.S. My exception below. See it occurs on 'expire()' - similar stack trace as 
the one in 11620


 [ERROR] ttl-cleanup-worker-#159 - Critical system error detected. Will be 
handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler 
[tryStop=false, timeout=0, super=AbstractFailureHandler 
[ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_TERMINATION, err=class 
o.a.i.i.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException
 [part=1013, msg=Adding entry to partition that is concurrently evicted 
[grp=mainCache, part=1013, shouldBeMoving=, belongs=false, 
topVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1], 
curTopVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1] 
org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException:
 Adding entry to partition that is concurrently evicted [grp=mainCache, 
part=1013, shouldBeMoving=, belongs=false, topVer=AffinityTopologyVersion 
[topVer=1978, minorTopVer=1], curTopVer=AffinityTopologyVersion [topVer=1978, 
minorTopVer=1]] at 
org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.localPartition0(GridDhtPartitionTopologyImpl.java:950)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5] at 
org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.localPartition(GridDhtPartitionTopologyImpl.java:825)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5] at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridCachePartitionedConcurrentMap.localPartition(GridCachePartitionedConcurrentMap.java:70)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5] at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridCachePartitionedConcurrentMap.putEntryIfObsoleteOrAbsent(GridCachePartitionedConcurrentMap.java:89)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5] at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter.entryEx(GridCacheAdapter.java:1008)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5] at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheAdapter.entryEx(GridDhtCacheAdapter.java:544)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5] at 
org.apache.ignite.internal.processors.cache.GridCacheAdapter.entryEx(GridCacheAdapter.java:999)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5] at 
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.expireInternal(IgniteCacheOffheapManagerImpl.java:1403)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5] at 
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.expire(IgniteCacheOffheapManagerImpl.java:1347)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5] at 
org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:207)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5] at 
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:139)
 [ignite-core-2.7.5-0-2.jar:2.7.5] at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) 
[ignite-core-2.7.5-0-2.jar:2.7.5] at java.lang.Thread.run(Thread.java:748) 
[?:1.8.0_222]

From: user@ignite.apache.org At: 01/31/20 05:11:57To:  user@ignite.apache.org
Subject: Re: "Adding entry to partition that is concurrently evicted" error

  
Hi,
  
  Current problem should be solved in ignite-2.8. I am not sure why   
this fix isn't a part of ignite-2.7.6.
  
  https://issues.apache.org/jira/browse/IGNITE-11127
  
  Your cluster was stopped because of failure handler work.
  
https://apacheignite.readme.io/docs/critical-failures-handling#section-failure-handling
  
  I am not sure about possible workarounds here (probably you can   set 
the NoOpFailureHandler). You also can try to create the thread   on 
developer user list:
  
http://apache-ignite-developers.2346864.n4.nabble.com/Apache-Ignite-2-7-release-td34076i40.html
  
  BR,
  Andrei 
1/29/2020 1:58 AM, Abhishek Gupta       (BLOOMBERG/ 919 3RD A) пишет:
 
  
Hello!  I've got a 6 node Ignite 2.7.5 grid. I had this strange issue where 
multiple nodes hit the following exception -   [ERROR] [sys-stripe-53-#54] 
GridCacheIoManager - Failed to process message 
[senderI

"Adding entry to partition that is concurrently evicted" error

2020-01-28 Thread Abhishek Gupta (BLOOMBERG/ 919 3RD A)
Hello!
 I've got a 6 node Ignite 2.7.5 grid. I had this strange issue where 
multiple nodes hit the following exception - 

[ERROR] [sys-stripe-53-#54] GridCacheIoManager - Failed to process message 
[senderId=f4a736b6-cfff-4548-a8b4-358d54d19ac6, messageType=class 
o.a.i.i.processors.cache.distributed.near.GridNearGetRequest]
org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException:
 Adding entry to partition that is concurrently evicted [grp=mainCache, 
part=733, shouldBeMoving=, belongs=false, topVer=AffinityTopologyVersion 
[topVer=1978, minorTopVer=1], curTopVer=AffinityTopologyVersion [topVer=1978, 
minorTopVer=1]]

and then died after 
2020-01-27 13:30:19.849 [ERROR] [ttl-cleanup-worker-#159]  - JVM will be halted 
immediately due to the failure: [failureCtx=FailureContext 
[type=SYSTEM_WORKER_TERMINATION, err=class 
o.a.i.i.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException
 [part=1013, msg=Adding entry to partition that is concurrently evicted 
[grp=mainCache, part=1013, shouldBeMoving=, belongs=false, 
topVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1], 
curTopVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1]

The sequence of events was simply the following - 
One of the nodes (lets call it node 1) was down for 2.5 hours and restarted. 
After a configured delay of 20 mins, it started to rebalance from the other 5 
nodes. There were no other nodes that joined or left in this period. 40 minutes 
into the rebalance the the above errors started showing in the other nodes and 
they just bounced, and therefore there was data loss. 

I found a few links related to this but nothing that explained the root cause 
or what my work around could be - 

* 
http://apache-ignite-users.70518.x6.nabble.com/Adding-entry-to-partition-that-is-concurrently-evicted-td24782.html#a24786
* https://issues.apache.org/jira/browse/IGNITE-9803
* https://issues.apache.org/jira/browse/IGNITE-11620


Thanks,
Abhishek



Re: Throttling getAll

2019-10-28 Thread Abhishek Gupta (BLOOMBERG/ 919 3RD A)
Ack. I've create a JIRA to track this.

https://issues.apache.org/jira/browse/IGNITE-12334


From: user@ignite.apache.org At: 10/28/19 09:08:10To:  user@ignite.apache.org
Subject: Re: Throttling getAll

You might want to open a ticket. Of course, Ignite is open source and I’m sure 
the community would welcome a pull request.

Regards,
Stephen


On 28 Oct 2019, at 12:14, Abhishek Gupta (BLOOMBERG/ 919 3RD A) 
 wrote:



Thanks Ilya for your response.

Even if my value objects were not large, nothing stops clients from doing a 
getAll with say 100,000 keys. Having some kind of throttling would still be 
useful.

-Abhishek


- Original Message -
From: Ilya Kasnacheev 
To: ABHISHEK GUPTA
CC: user@ignite.apache.org
At: 28-Oct-2019 07:20:24


Hello!

Having very large objects is not a priority use case of Apache Ignite. Thus, it 
is your concern to make sure you don't run out of heap when doing operations on 
Ignite caches.

Regards,
--
Ilya Kasnacheev


сб, 26 окт. 2019 г. в 18:51, Abhishek Gupta (BLOOMBERG/ 919 3RD A) 
:

Hello,
  I've benchmarked my grid for users (clients) to do getAll with upto 100 
keys at a time. My value objects tend to be quite large and my worry is if 
there are errant clients might at times do a getAll with a larger number of 
keys - say 1000. If that happens I worry about GC issues/humongous objects/OOM 
on the grid. Is there a way to configure the grid to auto-split these requests 
into smaller batches (smaller number of keys per batch) or rejecting them?   


Thanks,
Abhishek




Re: Intermittent "Partition states validation has failed for group" issues

2019-10-28 Thread Abhishek Gupta (BLOOMBERG/ 919 3RD A)
Thanks Ilya. The thing is, I've seen these exceptions without any errors 
occurring before them. Also I'm not using persistence. Also, I've seen this 
happen on multiple nodes at the same time. If I bounce multiple nodes, I would 
loose data (since I have only 1 backup). Anything else I could do?


-Abhishek


From: user@ignite.apache.org At: 10/28/19 12:47:23Cc:  user@ignite.apache.org
Subject: Re: Intermittent "Partition states validation has failed for group" 
issues

Hello!

I think this means that backup/primary contents are inconsistent.

The implications is that in case of node failure there will be data 
inconsistency (or maybe it's already there).

The recommendation is to a) check logs for any oddities/exceptions, and b) 
maybe remove problematic partitions' files from persistence and/or restart 
problematic nodes.

Regards,
-- 
Ilya Kasnacheev


пн, 21 окт. 2019 г. в 23:17, Abhishek Gupta (BLOOMBERG/ 731 LEX) 
:

In my otherwise stably running grid (on 2.7.5) I sometimes see intermittent  
GridDhtPartitionsExchangeFuture  warning. This warning the occurs periodically 
and then goes away after some time. I couldn't find any documentation or other 
threads about this warning and its implications. 
* What is the trigger for this warning? 
* What are the implications?
* Is there any recommendation around fixing this issue?


2019-10-21 16:09:44.378 [WARN ] [sys-#26240] GridDhtPartitionsExchangeFuture - 
Partition states validation has failed for group: mainCache. Partitions cache 
sizes are inconsistent for Part 0: [id-dgcasp-ob-398-csp-drp-ny-1=43417 
id-dgcasp-ob-080-csp-drp-ny-1=43416 ] Part 1: 
[id-dgcasp-ob-080-csp-drp-ny-1=43720 id-dgcasp-ob-471-csp-drp-ny-1=43724 ] Part 
2: [id-dgcasp-ob-762-csp-drp-ny-1=43388 id-dgcasp-ob-471-csp-drp-ny-1=43376 ] 
Part 3: [id-dgcasp-ob-775-csp-drp-ny-1=43488 
id-dgcasp-ob-403-csp-drp-ny-1=43484 ] Part 4: 
[id-dgcasp-ob-080-csp-drp-ny-1=43338 id-dgcasp-ob-471-csp-drp-ny-1=43339 ] Part 
5: [id-dgcasp-ob-398-csp-drp-ny-1=43105 id-dgcasp-ob-471-csp-drp-ny-1=43106 ] 
Part 7: [id-dgcasp-ob-775-csp-drp-ny-1=43151 
id-dgcasp-ob-762-csp-drp-ny-1=43157 ] Part 8: 
[id-dgcasp-ob-398-csp-drp-ny-1=42975 id-dgcasp-ob-471-csp-drp-ny-1=42976 ] Part 
10: [id-dgcasp-ob-775-csp-drp-ny-1=43033 id-dgcasp-ob-471-csp-drp-ny-1=43036 ] 
Part 11: [id-dgcasp-ob-762-csp-drp-ny-1=43303 
id-dgcasp-ob-471-csp-drp-ny-1=43299 ] Part 12: 
[id-dgcasp-ob-398-csp-drp-ny-1=43262 id-dgcasp-ob-471-csp-drp-ny-1=43265 ] Part 
13: [id-dgcasp-ob-762-csp-drp-ny-1=43123 id-dgcasp-ob-471-csp-drp-ny-1=43120 ] 
Part 15: [id-dgcasp-ob-775-csp-drp-ny-1=43412 
id-dgcasp-ob-398-csp-drp-ny-1=43413 ] Part 16: 
[id-dgcasp-ob-471-csp-drp-ny-1=43934 id-dgcasp-ob-403-csp-drp-ny-1=43933 ] Part 
20: [id-dgcasp-ob-080-csp-drp-ny-1=43146 id-dgcasp-ob-471-csp-drp-ny-1=43148 ] 
Part 21: [id-dgcasp-ob-762-csp-drp-ny-1=43196 
id-dgcasp-ob-080-csp-drp-ny-1=43197 ] Part 22: 
[id-dgcasp-ob-398-csp-drp-ny-1=43233 id-dgcasp-ob-762-csp-drp-ny-1=43234 ] Part 
23: [id-dgcasp-ob-398-csp-drp-ny-1=43127 id-dgcasp-ob-471-csp-drp-ny-1=43128 ] 
Part 24: [id-dgcasp-ob-775-csp-drp-ny-1=43144 
id-dgcasp-ob-398-csp-drp-ny-1=43142 ]  ... TRUNCATED


Thanks,
Abhishek




Re: Throttling getAll

2019-10-28 Thread Abhishek Gupta (BLOOMBERG/ 919 3RD A)
Thanks Ilya for your response.

Even if my value objects were not large, nothing stops clients from doing a 
getAll with say 100,000 keys. Having some kind of throttling would still be 
useful.

-Abhishek


- Original Message -
From: Ilya Kasnacheev 
To: ABHISHEK GUPTA
CC: user@ignite.apache.org
At: 28-Oct-2019 07:20:24


Hello!

Having very large objects is not a priority use case of Apache Ignite. Thus, it 
is your concern to make sure you don't run out of heap when doing operations on 
Ignite caches.

Regards,
--
Ilya Kasnacheev


сб, 26 окт. 2019 г. в 18:51, Abhishek Gupta (BLOOMBERG/ 919 3RD A) 
:

Hello,
  I've benchmarked my grid for users (clients) to do getAll with upto 100 
keys at a time. My value objects tend to be quite large and my worry is if 
there are errant clients might at times do a getAll with a larger number of 
keys - say 1000. If that happens I worry about GC issues/humongous objects/OOM 
on the grid. Is there a way to configure the grid to auto-split these requests 
into smaller batches (smaller number of keys per batch) or rejecting them?   


Thanks,
Abhishek




Throttling getAll

2019-10-26 Thread Abhishek Gupta (BLOOMBERG/ 919 3RD A)
Hello,
  I've benchmarked my grid for users (clients) to do getAll with upto 100 
keys at a time. My value objects tend to be quite large and my worry is if 
there are errant clients might at times do a getAll with a larger number of 
keys - say 1000. If that happens I worry about GC issues/humongous objects/OOM 
on the grid. Is there a way to configure the grid to auto-split these requests 
into smaller batches (smaller number of keys per batch) or rejecting them?   


Thanks,
Abhishek