[ https://issues.apache.org/jira/browse/IGNITE-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vyacheslav Koptilin updated IGNITE-17507: ----------------------------------------- Ignite Flags: Release Notes Required > Failed to wait for partition map exchange on some clients > --------------------------------------------------------- > > Key: IGNITE-17507 > URL: https://issues.apache.org/jira/browse/IGNITE-17507 > Project: Ignite > Issue Type: Bug > Reporter: Vyacheslav Koptilin > Assignee: Vyacheslav Koptilin > Priority: Major > Fix For: 2.14 > > Time Spent: 0.5h > Remaining Estimate: 0h > > We have scenario with several client and server nodes, which can stuck on PME > after start: > * Start some server nodes > * Trigger rebalance > * Start some client and server nodes > * Some of the client nodes stuck with _Failed to wait for partition map > exchange [topVer=AffinityTopologyVersion…_ > Deep investigation of the logs showed, that the root cause of the stuck PME > on client is the race between joining new client node and receiving stale > _CacheAffinityChangeMessage_ on a client, which causes PME, but when other > old nodes receive this _CacheAffinityChangeMessage_, they skip it because of > some optimization. > Optimization can be found in the method > _CacheAffinitySharedManager#onDiscoveryEvent_, we save _lastAffVer = topVer_ > for old nodes, but because of some race _lastAffVer_ for the problem client > node is null when we reach _CacheAffinitySharedManager#onCustomEvent_ and we > schedule invalid PME in _msg.exchangeNeeded(exchangeNeeded)_, but other > nodes skip this PME > The possible fix is that we can try to make the _CacheAffinityChangeMessage_ > mutable (mutable discovery custom message). It allows to modify the message > before sending it across the ring. This approach does not require to make a > decision to apply or skip the message on client nodes, the required flag will > be transferred from a server node. In case of using Zookeeper Discovery, > there is no ability to mutate discovery messages. However is is possible to > mutate the message on the coordinator node (this requires adding > _stopProcess_ flag in _DiscoveryCustomMessage_ which was removed by > IGNITE-12400). This is quite enough for our case. -- This message was sent by Atlassian Jira (v8.20.10#820010)