[ https://issues.apache.org/jira/browse/IGNITE-12255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944589#comment-16944589 ]
Ignite TC Bot commented on IGNITE-12255: ---------------------------------------- {panel:title=Branch: [pull/6933/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} [TeamCity *--> Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=4658217&buildTypeId=IgniteTests24Java8_RunAll] > Cache affinity fetching and calculation on client nodes may be broken in some > cases > ----------------------------------------------------------------------------------- > > Key: IGNITE-12255 > URL: https://issues.apache.org/jira/browse/IGNITE-12255 > Project: Ignite > Issue Type: Bug > Components: cache > Affects Versions: 2.5, 2.7 > Reporter: Pavel Kovalenko > Assignee: Pavel Kovalenko > Priority: Critical > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > We have a cluster with server and client nodes. > We dynamically start several caches on a cluster. > Periodically we create and destroy some temporary cache in a cluster to move > up cluster topology version. > At the same time, a random client node chooses a random existing cache and > performs operations on that cache. > It leads to an exception on client node that affinity is not initialized for > a cache during cache operation like: > Affinity for topology version is not initialized [topVer = 8:10, head = 8:2] > This exception means that the last affinity for a cache is calculated on > version [8,2]. This is a cache start version. It happens because during > creating/destroying some temporary cache we don’t re-calculate affinity for > all existing but not already accessed caches on client nodes. Re-calculate in > this case is cheap - we just copy affinity assignment and increment topology > version. > As a solution, we need to fetch affinity on client node join for all caches. > Also, we need to re-calculate affinity for all affinity holders (not only for > started caches or only configured caches) for all topology events that > happened in a cluster on a client node. > This solution showed the existing race between client node join and > concurrent cache destroy. > The race is the following: > Client node (with some configured caches) joins to a cluster sending > SingleMessage to coordinator during client PME. This SingleMessage contains > affinity fetch requests for all cluster caches. When SingleMessage is > in-flight server nodes finish client PME and also process and finish cache > destroy PME. When a cache is destroyed affinity for that cache is cleared. > When SingleMessage delivered to coordinator it doesn’t have affinity for a > requested cache because the cache is already destroyed. It leads to assertion > error on the coordinator and unpredictable behavior on the client node. > The race may be fixed with the following change: > If the coordinator doesn’t have an affinity for requested cache from the > client node, it doesn’t break PME with assertion error, just doesn’t send > affinity for that cache to a client node. When the client node receives > FullMessage and sees that affinity for some requested cache doesn’t exist, it > just closes cache proxy for user interactions which throws CacheStopped > exception for every attempt to use that cache. This is safe behavior because > cache destroy event should be happened on the client node soon and destroy > that cache completely. -- This message was sent by Atlassian Jira (v8.3.4#803005)