Re: "Adding entry to partition that is concurrently evicted" error
Thanks Andrei. Looking at my exception (see below), it seem like it is related to https://issues.apache.org/jira/browse/IGNITE-11620 in that it occurred while expiration was going on. 1. As a workaround, would it be valid to increase my ttl to reduce the possibility of this occurring ? 2. My worry about using "NoOpFailureHandler" is that the error would still have occurred and it might have put the node in a bad situation which might be just as bad or worse than just killing the node. If you can confirm 1. is a valid line of defense (albeit not air-tight), that would be great. Thanks, Abhishek P.S. My exception below. See it occurs on 'expire()' - similar stack trace as the one in 11620 [ERROR] ttl-cleanup-worker-#159 - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.i.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException [part=1013, msg=Adding entry to partition that is concurrently evicted [grp=mainCache, part=1013, shouldBeMoving=, belongs=false, topVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1], curTopVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1] org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException: Adding entry to partition that is concurrently evicted [grp=mainCache, part=1013, shouldBeMoving=, belongs=false, topVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1], curTopVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1]] at org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.localPartition0(GridDhtPartitionTopologyImpl.java:950) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.localPartition(GridDhtPartitionTopologyImpl.java:825) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.distributed.dht.GridCachePartitionedConcurrentMap.localPartition(GridCachePartitionedConcurrentMap.java:70) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.distributed.dht.GridCachePartitionedConcurrentMap.putEntryIfObsoleteOrAbsent(GridCachePartitionedConcurrentMap.java:89) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.GridCacheAdapter.entryEx(GridCacheAdapter.java:1008) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheAdapter.entryEx(GridDhtCacheAdapter.java:544) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.GridCacheAdapter.entryEx(GridCacheAdapter.java:999) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.expireInternal(IgniteCacheOffheapManagerImpl.java:1403) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.expire(IgniteCacheOffheapManagerImpl.java:1347) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:207) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:139) [ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222] From: user@ignite.apache.org At: 01/31/20 05:11:57To: user@ignite.apache.org Subject: Re: "Adding entry to partition that is concurrently evicted" error Hi, Current problem should be solved in ignite-2.8. I am not sure why this fix isn't a part of ignite-2.7.6. https://issues.apache.org/jira/browse/IGNITE-11127 Your cluster was stopped because of failure handler work. https://apacheignite.readme.io/docs/critical-failures-handling#section-failure-handling I am not sure about possible workarounds here (probably you can set the NoOpFailureHandler). You also can try to create the thread on developer user list: http://apache-ignite-developers.2346864.n4.nabble.com/Apache-Ignite-2-7-release-td34076i40.html BR, Andrei 1/29/2020 1:58 AM, Abhishek Gupta (BLOOMBERG/ 919 3RD A) пишет: Hello! I've got a 6 node Ignite 2.7.5 grid. I had this strange issue where multiple nodes hit the following exception - [ERROR] [sys-stripe-53-#54] GridCacheIoManager - Failed to process message [senderI
"Adding entry to partition that is concurrently evicted" error
Hello! I've got a 6 node Ignite 2.7.5 grid. I had this strange issue where multiple nodes hit the following exception - [ERROR] [sys-stripe-53-#54] GridCacheIoManager - Failed to process message [senderId=f4a736b6-cfff-4548-a8b4-358d54d19ac6, messageType=class o.a.i.i.processors.cache.distributed.near.GridNearGetRequest] org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException: Adding entry to partition that is concurrently evicted [grp=mainCache, part=733, shouldBeMoving=, belongs=false, topVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1], curTopVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1]] and then died after 2020-01-27 13:30:19.849 [ERROR] [ttl-cleanup-worker-#159] - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.i.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException [part=1013, msg=Adding entry to partition that is concurrently evicted [grp=mainCache, part=1013, shouldBeMoving=, belongs=false, topVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1], curTopVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1] The sequence of events was simply the following - One of the nodes (lets call it node 1) was down for 2.5 hours and restarted. After a configured delay of 20 mins, it started to rebalance from the other 5 nodes. There were no other nodes that joined or left in this period. 40 minutes into the rebalance the the above errors started showing in the other nodes and they just bounced, and therefore there was data loss. I found a few links related to this but nothing that explained the root cause or what my work around could be - * http://apache-ignite-users.70518.x6.nabble.com/Adding-entry-to-partition-that-is-concurrently-evicted-td24782.html#a24786 * https://issues.apache.org/jira/browse/IGNITE-9803 * https://issues.apache.org/jira/browse/IGNITE-11620 Thanks, Abhishek
Re: Throttling getAll
Ack. I've create a JIRA to track this. https://issues.apache.org/jira/browse/IGNITE-12334 From: user@ignite.apache.org At: 10/28/19 09:08:10To: user@ignite.apache.org Subject: Re: Throttling getAll You might want to open a ticket. Of course, Ignite is open source and I’m sure the community would welcome a pull request. Regards, Stephen On 28 Oct 2019, at 12:14, Abhishek Gupta (BLOOMBERG/ 919 3RD A) wrote: Thanks Ilya for your response. Even if my value objects were not large, nothing stops clients from doing a getAll with say 100,000 keys. Having some kind of throttling would still be useful. -Abhishek - Original Message - From: Ilya Kasnacheev To: ABHISHEK GUPTA CC: user@ignite.apache.org At: 28-Oct-2019 07:20:24 Hello! Having very large objects is not a priority use case of Apache Ignite. Thus, it is your concern to make sure you don't run out of heap when doing operations on Ignite caches. Regards, -- Ilya Kasnacheev сб, 26 окт. 2019 г. в 18:51, Abhishek Gupta (BLOOMBERG/ 919 3RD A) : Hello, I've benchmarked my grid for users (clients) to do getAll with upto 100 keys at a time. My value objects tend to be quite large and my worry is if there are errant clients might at times do a getAll with a larger number of keys - say 1000. If that happens I worry about GC issues/humongous objects/OOM on the grid. Is there a way to configure the grid to auto-split these requests into smaller batches (smaller number of keys per batch) or rejecting them? Thanks, Abhishek
Re: Intermittent "Partition states validation has failed for group" issues
Thanks Ilya. The thing is, I've seen these exceptions without any errors occurring before them. Also I'm not using persistence. Also, I've seen this happen on multiple nodes at the same time. If I bounce multiple nodes, I would loose data (since I have only 1 backup). Anything else I could do? -Abhishek From: user@ignite.apache.org At: 10/28/19 12:47:23Cc: user@ignite.apache.org Subject: Re: Intermittent "Partition states validation has failed for group" issues Hello! I think this means that backup/primary contents are inconsistent. The implications is that in case of node failure there will be data inconsistency (or maybe it's already there). The recommendation is to a) check logs for any oddities/exceptions, and b) maybe remove problematic partitions' files from persistence and/or restart problematic nodes. Regards, -- Ilya Kasnacheev пн, 21 окт. 2019 г. в 23:17, Abhishek Gupta (BLOOMBERG/ 731 LEX) : In my otherwise stably running grid (on 2.7.5) I sometimes see intermittent GridDhtPartitionsExchangeFuture warning. This warning the occurs periodically and then goes away after some time. I couldn't find any documentation or other threads about this warning and its implications. * What is the trigger for this warning? * What are the implications? * Is there any recommendation around fixing this issue? 2019-10-21 16:09:44.378 [WARN ] [sys-#26240] GridDhtPartitionsExchangeFuture - Partition states validation has failed for group: mainCache. Partitions cache sizes are inconsistent for Part 0: [id-dgcasp-ob-398-csp-drp-ny-1=43417 id-dgcasp-ob-080-csp-drp-ny-1=43416 ] Part 1: [id-dgcasp-ob-080-csp-drp-ny-1=43720 id-dgcasp-ob-471-csp-drp-ny-1=43724 ] Part 2: [id-dgcasp-ob-762-csp-drp-ny-1=43388 id-dgcasp-ob-471-csp-drp-ny-1=43376 ] Part 3: [id-dgcasp-ob-775-csp-drp-ny-1=43488 id-dgcasp-ob-403-csp-drp-ny-1=43484 ] Part 4: [id-dgcasp-ob-080-csp-drp-ny-1=43338 id-dgcasp-ob-471-csp-drp-ny-1=43339 ] Part 5: [id-dgcasp-ob-398-csp-drp-ny-1=43105 id-dgcasp-ob-471-csp-drp-ny-1=43106 ] Part 7: [id-dgcasp-ob-775-csp-drp-ny-1=43151 id-dgcasp-ob-762-csp-drp-ny-1=43157 ] Part 8: [id-dgcasp-ob-398-csp-drp-ny-1=42975 id-dgcasp-ob-471-csp-drp-ny-1=42976 ] Part 10: [id-dgcasp-ob-775-csp-drp-ny-1=43033 id-dgcasp-ob-471-csp-drp-ny-1=43036 ] Part 11: [id-dgcasp-ob-762-csp-drp-ny-1=43303 id-dgcasp-ob-471-csp-drp-ny-1=43299 ] Part 12: [id-dgcasp-ob-398-csp-drp-ny-1=43262 id-dgcasp-ob-471-csp-drp-ny-1=43265 ] Part 13: [id-dgcasp-ob-762-csp-drp-ny-1=43123 id-dgcasp-ob-471-csp-drp-ny-1=43120 ] Part 15: [id-dgcasp-ob-775-csp-drp-ny-1=43412 id-dgcasp-ob-398-csp-drp-ny-1=43413 ] Part 16: [id-dgcasp-ob-471-csp-drp-ny-1=43934 id-dgcasp-ob-403-csp-drp-ny-1=43933 ] Part 20: [id-dgcasp-ob-080-csp-drp-ny-1=43146 id-dgcasp-ob-471-csp-drp-ny-1=43148 ] Part 21: [id-dgcasp-ob-762-csp-drp-ny-1=43196 id-dgcasp-ob-080-csp-drp-ny-1=43197 ] Part 22: [id-dgcasp-ob-398-csp-drp-ny-1=43233 id-dgcasp-ob-762-csp-drp-ny-1=43234 ] Part 23: [id-dgcasp-ob-398-csp-drp-ny-1=43127 id-dgcasp-ob-471-csp-drp-ny-1=43128 ] Part 24: [id-dgcasp-ob-775-csp-drp-ny-1=43144 id-dgcasp-ob-398-csp-drp-ny-1=43142 ] ... TRUNCATED Thanks, Abhishek
Re: Throttling getAll
Thanks Ilya for your response. Even if my value objects were not large, nothing stops clients from doing a getAll with say 100,000 keys. Having some kind of throttling would still be useful. -Abhishek - Original Message - From: Ilya Kasnacheev To: ABHISHEK GUPTA CC: user@ignite.apache.org At: 28-Oct-2019 07:20:24 Hello! Having very large objects is not a priority use case of Apache Ignite. Thus, it is your concern to make sure you don't run out of heap when doing operations on Ignite caches. Regards, -- Ilya Kasnacheev сб, 26 окт. 2019 г. в 18:51, Abhishek Gupta (BLOOMBERG/ 919 3RD A) : Hello, I've benchmarked my grid for users (clients) to do getAll with upto 100 keys at a time. My value objects tend to be quite large and my worry is if there are errant clients might at times do a getAll with a larger number of keys - say 1000. If that happens I worry about GC issues/humongous objects/OOM on the grid. Is there a way to configure the grid to auto-split these requests into smaller batches (smaller number of keys per batch) or rejecting them? Thanks, Abhishek
Throttling getAll
Hello, I've benchmarked my grid for users (clients) to do getAll with upto 100 keys at a time. My value objects tend to be quite large and my worry is if there are errant clients might at times do a getAll with a larger number of keys - say 1000. If that happens I worry about GC issues/humongous objects/OOM on the grid. Is there a way to configure the grid to auto-split these requests into smaller batches (smaller number of keys per batch) or rejecting them? Thanks, Abhishek