Pavel Kovalenko created IGNITE-12255:
----------------------------------------

             Summary: Cache affinity fetching and calculation on client nodes 
may be broken in some cases
                 Key: IGNITE-12255
                 URL: https://issues.apache.org/jira/browse/IGNITE-12255
             Project: Ignite
          Issue Type: Bug
          Components: cache
    Affects Versions: 2.7, 2.5
            Reporter: Pavel Kovalenko
            Assignee: Pavel Kovalenko
             Fix For: 2.8


We have a cluster with server and client nodes.
We dynamically start several caches on a cluster.
Periodically we create and destroy some temporary cache in a cluster to move up 
cluster topology version.
At the same time, a random client node chooses a random existing cache and 
performs operations on that cache.
It leads to an exception on client node that affinity is not initialized for a 
cache during cache operation like:
Affinity for topology version is not initialized [topVer = 8:10, head = 8:2]

This exception means that the last affinity for a cache is calculated on 
version [8,2]. This is a cache start version. It happens because during 
creating/destroying some temporary cache we don’t re-calculate affinity for all 
existing but not already accessed caches on client nodes. Re-calculate in this 
case is cheap - we just copy affinity assignment and increment topology version.

As a solution, we need to fetch affinity on client node join for all caches. 
Also, we need to re-calculate affinity for all affinity holders (not only for 
started caches or only configured caches) for all topology events that happened 
in a cluster on a client node.

This solution showed the existing race between client node join and concurrent 
cache destroy.

The race is the following:

Client node (with some configured caches) joins to a cluster sending 
SingleMessage to coordinator during client PME. This SingleMessage contains 
affinity fetch requests for all cluster caches. When SingleMessage is in-flight 
server nodes finish client PME and also process and finish cache destroy PME. 
When a cache is destroyed affinity for that cache is cleared. When 
SingleMessage delivered to coordinator it doesn’t have affinity for a requested 
cache because the cache is already destroyed. It leads to assertion error on 
the coordinator and unpredictable behavior on the client node.

The race may be fixed with the following change:

If the coordinator doesn’t have an affinity for requested cache from the client 
node, it doesn’t break PME with assertion error, just doesn’t send affinity for 
that cache to a client node. When the client node receives FullMessage and sees 
that affinity for some requested cache doesn’t exist, it just closes cache 
proxy for user interactions which throws CacheStopped exception for every 
attempt to use that cache. This is safe behavior because cache destroy event 
should be happened on the client node soon and destroy that cache completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to