[ https://issues.apache.org/jira/browse/IGNITE-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728744#comment-16728744 ]
Pavel Pereslegin commented on IGNITE-10058: ------------------------------------------- In my last commit, I prepared a draft solution to meet suggestion by [~ilantukh] and assign partition states only on coordinator, but it seems these changes are too complex (see PR). To meet requirement about zero update counters I divided resetLostPartitions into three steps: 1. When non-coordinator node preparing to send local partition states to coordinator - reset counters if owner is present (to start rebalancing after this exchange, see {{ResetLostPartitionTest}}). 2. Coordinator assigns partition states (including local ones) and resets local partition counters, if necessary. 3. When a non-coordinator node receives full message from coordinator, it changes state of local partitions (LOST -> OWNING) if necessary. I am too late with this task and will not be able to work on it for the next 2 weeks, so feel free to assign this ticket to yourself. The main case described is fixed by calling {{checkRebalanceState}} after resetting lost partitions, this allows to set new affinity and evict duplicate partitions. Added {{IgniteCachePartitionLossPolicySelfTest.testReadWriteSafeRefreshDelay}} test to reproduce this problem. > resetLostPartitions() leaves an additional copy of a partition in the cluster > ----------------------------------------------------------------------------- > > Key: IGNITE-10058 > URL: https://issues.apache.org/jira/browse/IGNITE-10058 > Project: Ignite > Issue Type: Bug > Reporter: Stanislav Lukyanov > Assignee: Pavel Pereslegin > Priority: Major > Fix For: 2.8 > > > If there are several copies of a LOST partition, resetLostPartitions() will > leave all of them in the cluster as OWNING. > Scenario: > 1) Start 4 nodes, a cache with backups=0 and READ_WRITE_SAFE, fill the cache > 2) Stop one node - some partitions are recreated on the remaining nodes as > LOST > 3) Start one node - the LOST partitions are being rebalanced to the new node > from the existing ones > 4) Wait for rebalance to complete > 5) Call resetLostPartitions() > After that the partitions that were LOST become OWNING on all nodes that had > them. Eviction of these partitions doesn't start. > Need to correctly evict additional copies of LOST partitions either after > rebalance on step 4 or after resetLostPartitions() call on step 5. > Current resetLostPartitions() implementation does call checkEvictions(), but > the ready affinity assignment contains several nodes per partition for some > reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005)