Hello!
     I've got a 6 node Ignite 2.7.5 grid. I had this strange issue where 
multiple nodes hit the following exception - 

[ERROR] [sys-stripe-53-#54] GridCacheIoManager - Failed to process message 
[senderId=f4a736b6-cfff-4548-a8b4-358d54d19ac6, messageType=class 
o.a.i.i.processors.cache.distributed.near.GridNearGetRequest]
org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException:
 Adding entry to partition that is concurrently evicted [grp=mainCache, 
part=733, shouldBeMoving=, belongs=false, topVer=AffinityTopologyVersion 
[topVer=1978, minorTopVer=1], curTopVer=AffinityTopologyVersion [topVer=1978, 
minorTopVer=1]]

and then died after 
2020-01-27 13:30:19.849 [ERROR] [ttl-cleanup-worker-#159]  - JVM will be halted 
immediately due to the failure: [failureCtx=FailureContext 
[type=SYSTEM_WORKER_TERMINATION, err=class 
o.a.i.i.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException
 [part=1013, msg=Adding entry to partition that is concurrently evicted 
[grp=mainCache, part=1013, shouldBeMoving=, belongs=false, 
topVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1], 
curTopVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1]]]]]

The sequence of events was simply the following - 
One of the nodes (lets call it node 1) was down for 2.5 hours and restarted. 
After a configured delay of 20 mins, it started to rebalance from the other 5 
nodes. There were no other nodes that joined or left in this period. 40 minutes 
into the rebalance the the above errors started showing in the other nodes and 
they just bounced, and therefore there was data loss. 

I found a few links related to this but nothing that explained the root cause 
or what my work around could be - 

* 
http://apache-ignite-users.70518.x6.nabble.com/Adding-entry-to-partition-that-is-concurrently-evicted-td24782.html#a24786
* https://issues.apache.org/jira/browse/IGNITE-9803
* https://issues.apache.org/jira/browse/IGNITE-11620


Thanks,
Abhishek

Reply via email to