[ https://issues.apache.org/jira/browse/IGNITE-21194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806109#comment-17806109 ]
Kirill Gusakov commented on IGNITE-21194: ----------------------------------------- Some more details: * Two changes of stable assignments to the same value is really produced * As mentioned the second one produces by the RebalanceRaftGroupEventsListener.onLeaderElected * But the tricky point is the fact, that the onLeaderElected will not produce the new rebalance if the previous leader write stable+pending pair before die. Because it checks that the pending assignments is not null and run new rebalance only on this case. So, it looks like the only possible case, when the described behaviour available: * leader1 finish the rebalance and fire the metastore write call (about the pendings clean and stable update) and die immedately due to node stop process start. But - the metastore call is still in-flight * new leader2 elected and see that the pendings is not empty. So, it start the rebalance from pending to stable, but raft group is already on this configuration - rebalance done immediately and want to push stable updates. * at this moment in-flight stable+pendings metastore update is applied (first notification about the stable update triggered) * leader2 push the update to pendings+stable from himself (second notification about the stable update triggered) > StorageException in ItIgniteNodeRestartTest#destroyObsoleteStoragesOnRestart > ---------------------------------------------------------------------------- > > Key: IGNITE-21194 > URL: https://issues.apache.org/jira/browse/IGNITE-21194 > Project: Ignite > Issue Type: Bug > Reporter: Denis Chudov > Priority: Major > Labels: ignite-3 > Attachments: full.log > > > Test passes successully, but there are exceptions in logs. > The scenario of this test includes altering the distribution zone. But the > subsequent notification about stable assignments at the end of rebalance > happens 2 times on the same node, with the same assignments. As a result, > redundant partitions are stopped and the storages are deleted on the first > event handling, and they are not found on the second one, which causes > exceptions. > Seems that the second stable assignments change is triggered by the rebalance > raft configuration listener ( > RebalanceRaftGroupEventsListener#doOnNewPeersConfigurationApplied ) which is > triggered on the configuration changed by the new leader election: > {code:java} > [2024-01-05T19:18:36,891][INFO > ][%iinrt_dosor_1%rebalance-scheduler-0][RebalanceRaftGroupEventsListener] New > leader elected. Going to apply new configuration [tablePartitionId=6_part_0, > peers=[iinrt_dosor_1], learners=[]]{code} > Probably we should check that the new set of peers differs from the others to > make some rebalance related updates to meta storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)