Vyacheslav Koptilin created IGNITE-16589:
--------------------------------------------

             Summary: Failure handler kills server node on getting affinity 
from old topology
                 Key: IGNITE-16589
                 URL: https://issues.apache.org/jira/browse/IGNITE-16589
             Project: Ignite
          Issue Type: Bug
            Reporter: Vyacheslav Koptilin
            Assignee: Vyacheslav Koptilin
             Fix For: 2.13


In general, the following exception seems to be a bit overkill
{code:java}
[2022-02-21 
10:34:53,347][ERROR][aff-#300%cache.CacheNoAffinityExchangeTest0%][IgniteTestResources]
 Critical system error detected. Will be handled accordingly to configured 
handler [hnd=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]], 
failureCtx=FailureContext [type=CRITICAL_ERROR, 
err=java.lang.IllegalStateException: Getting affinity for too old topology 
version that is already out of history [locNode=TcpDiscoveryNode 
[id=0917cb9d-2825-46eb-b210-1e2846f00000, consistentId=127.0.0.1:47500, 
addrs=ArrayList [127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500], 
discPort=47500, order=1, intOrder=1, lastExchangeTime=1645428893228, loc=true, 
ver=2.13.0#20220218-sha1:7e63c212, isClient=false], grp=client-cache, 
topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], 
lastAffChangeTopVer=AffinityTopologyVersion [topVer=1, minorTopVer=2], 
head=AffinityTopologyVersion [topVer=8, minorTopVer=0], 
history=[AffinityTopologyVersion [topVer=7, minorTopVer=0], 
AffinityTopologyVersion [topVer=8, minorTopVer=0]]]]]
java.lang.IllegalStateException: Getting affinity for too old topology version 
that is already out of history [locNode=TcpDiscoveryNode 
[id=0917cb9d-2825-46eb-b210-1e2846f00000, consistentId=127.0.0.1:47500, 
addrs=ArrayList [127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500], 
discPort=47500, order=1, intOrder=1, lastExchangeTime=1645428893228, loc=true, 
ver=2.13.0#20220218-sha1:7e63c212, isClient=false], grp=client-cache, 
topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], 
lastAffChangeTopVer=AffinityTopologyVersion [topVer=1, minorTopVer=2], 
head=AffinityTopologyVersion [topVer=8, minorTopVer=0], 
history=[AffinityTopologyVersion [topVer=7, minorTopVer=0], 
AffinityTopologyVersion [topVer=8, minorTopVer=0]]]
        at 
org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:849)
        at 
org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:796)
        at 
org.apache.ignite.internal.processors.cache.CacheGroupContext.processAffinityAssignmentRequest0(CacheGroupContext.java:1130)
        at 
org.apache.ignite.internal.processors.cache.CacheGroupContext.processAffinityAssignmentRequest(CacheGroupContext.java:1116)
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1151)
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:592)
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:393)
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:319)
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:110)
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:309)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1907)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1528)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:242)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1421)
        at 
org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

it seems that much more suitable way to handle this should be to kill/restart 
the client node or return an error for the client to handle or to attempt 
remapping the operation on the new topology. For now, it is not clear why 
exactly the node must be shut down after this exception and documentation 
doesn't explain how to avoid this.

The root cause of the issue is that AffinityRequest from the client node refers 
to the "old" topology version, which has been already wiped on the server-side.
The possible scenario is the following:
 - the client wants to get a proxy for a cache that is already started on 
server nodes
 - client started custom exchange task (see 
GridCacheProcessor.processCustomExchangeTask)
 - before sending AffinityRequest the client node hanged due to a long GC 
pause, for example
 - cluster topology changed multiple times during that GC pause (the number of 
changes is enough to clean up the old history of affinity assignments)
 - server node received AffinityRequest from the client and, unfortunately, 
cannot process it in the right way because of a lack of history.

IMHO we can respond to the client with an "empty" AffinityResponse which should 
provide a cause of the problem.
In that case, the client node may try to reconnect to the cluster (if PME is in 
progress), or retry an operation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to