Vyacheslav Koptilin created IGNITE-16589: --------------------------------------------
Summary: Failure handler kills server node on getting affinity from old topology Key: IGNITE-16589 URL: https://issues.apache.org/jira/browse/IGNITE-16589 Project: Ignite Issue Type: Bug Reporter: Vyacheslav Koptilin Assignee: Vyacheslav Koptilin Fix For: 2.13 In general, the following exception seems to be a bit overkill {code:java} [2022-02-21 10:34:53,347][ERROR][aff-#300%cache.CacheNoAffinityExchangeTest0%][IgniteTestResources] Critical system error detected. Will be handled accordingly to configured handler [hnd=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=java.lang.IllegalStateException: Getting affinity for too old topology version that is already out of history [locNode=TcpDiscoveryNode [id=0917cb9d-2825-46eb-b210-1e2846f00000, consistentId=127.0.0.1:47500, addrs=ArrayList [127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1645428893228, loc=true, ver=2.13.0#20220218-sha1:7e63c212, isClient=false], grp=client-cache, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], lastAffChangeTopVer=AffinityTopologyVersion [topVer=1, minorTopVer=2], head=AffinityTopologyVersion [topVer=8, minorTopVer=0], history=[AffinityTopologyVersion [topVer=7, minorTopVer=0], AffinityTopologyVersion [topVer=8, minorTopVer=0]]]]] java.lang.IllegalStateException: Getting affinity for too old topology version that is already out of history [locNode=TcpDiscoveryNode [id=0917cb9d-2825-46eb-b210-1e2846f00000, consistentId=127.0.0.1:47500, addrs=ArrayList [127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1645428893228, loc=true, ver=2.13.0#20220218-sha1:7e63c212, isClient=false], grp=client-cache, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], lastAffChangeTopVer=AffinityTopologyVersion [topVer=1, minorTopVer=2], head=AffinityTopologyVersion [topVer=8, minorTopVer=0], history=[AffinityTopologyVersion [topVer=7, minorTopVer=0], AffinityTopologyVersion [topVer=8, minorTopVer=0]]] at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:849) at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:796) at org.apache.ignite.internal.processors.cache.CacheGroupContext.processAffinityAssignmentRequest0(CacheGroupContext.java:1130) at org.apache.ignite.internal.processors.cache.CacheGroupContext.processAffinityAssignmentRequest(CacheGroupContext.java:1116) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1151) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:592) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:393) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:319) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:110) at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:309) at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1907) at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1528) at org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:242) at org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1421) at org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} it seems that much more suitable way to handle this should be to kill/restart the client node or return an error for the client to handle or to attempt remapping the operation on the new topology. For now, it is not clear why exactly the node must be shut down after this exception and documentation doesn't explain how to avoid this. The root cause of the issue is that AffinityRequest from the client node refers to the "old" topology version, which has been already wiped on the server-side. The possible scenario is the following: - the client wants to get a proxy for a cache that is already started on server nodes - client started custom exchange task (see GridCacheProcessor.processCustomExchangeTask) - before sending AffinityRequest the client node hanged due to a long GC pause, for example - cluster topology changed multiple times during that GC pause (the number of changes is enough to clean up the old history of affinity assignments) - server node received AffinityRequest from the client and, unfortunately, cannot process it in the right way because of a lack of history. IMHO we can respond to the client with an "empty" AffinityResponse which should provide a cause of the problem. In that case, the client node may try to reconnect to the cluster (if PME is in progress), or retry an operation. -- This message was sent by Atlassian Jira (v8.20.1#820001)