[ https://issues.apache.org/jira/browse/IGNITE-22904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Roman Puchkovskiy updated IGNITE-22904: --------------------------------------- Description: If some node did not see Metastorage repair, it will be migrated to the new cluster using the Migrate REST/CLI command. Such a node (judging by its local MG Raft log) might still think it's a member of the voting set, so it might propose itself as a candidate, and it can win the election if there are enough such nodes. This will result in the leadership being hijacked by the 'old' majority, which will mess the repaired Metastorage up. This has to be avoided. To do so, the following should be done: # In the CMG, add a property called mgRepairClusterId (empty in the blank cluster) # When, during MG repair (IGNITE-22899), we choose new metastorageNodes and save them to the CMG (which happens before resetPeers() is called), we write mgRepairClusterId together with metastorageNodes to the CMG # We add a property called witnessedMgRepairClusterId to the Vault. This property will store clusterId for the incarnation of the cluster in which the node witnessed MG repair (either it participated in the repair, or it was migrated and successfully performed the 'MG reentry' procedure, see below. This property is empty on a blank node # When a node handles MetastorageIndexTermRequestMessage, it writes current clusterId to its Vault.witnessedMgRepairClusterId. As a result, every node participating in the MG repair will be marked as a witness of the repair and we'll not need to do 'MG reentry' for them # On node start, before starting the MG, Ignite node gets from the CMG leader metastorageNodes and mgRepairClusterId. If it's not null and Vault.witnessedMgRepairClusterId is absent or differs from mgRepairClusterId, then the node has to perform the 'MG reentry' procedure. # The 'MG reentry' procedure is as follows: ## The node destroys all 3 Raft storages for MG (these are meta, log, snapshot storage) as well as Metastorage KV storage ## Writes current clusterId to Vault.witnessedMgRepairClusterId ## Then starts the MG Raft server as usual h2. Old (stale) description If, during a join (on getting the fresh cluster state from the CMG), a node detects that, according to the MG configuration saved in the MG on this node, this node is the member of the voting set (i.e. it’s a peer, not a learner), and this node is NOT one of the metastorageNodes in the CMG, then, before starting its MG Raft member, it raises a flag that disallows its Raft node becoming a candidate. (This flag does not exist in JRaft, we need to introduce it there; the flag is not persisted). As soon as the Raft node applies a new Raft configuration (coming from the new leader), this flag is cleared. After this, the Raft node is ‘converted’ to the new MG and cannot hijack the leadership. was: If, during a join (on getting the fresh cluster state from the CMG), a node detects that, according to the MG configuration saved in the MG on this node, this node is the member of the voting set (i.e. it’s a peer, not a learner), and this node is NOT one of the metastorageNodes in the CMG, then, before starting its MG Raft member, it raises a flag that disallows its Raft node becoming a candidate. (This flag does not exist in JRaft, we need to introduce it there; the flag is not persisted). As soon as the Raft node applies a new Raft configuration (coming from the new leader), this flag is cleared. After this, the Raft node is ‘converted’ to the new MG and cannot hijack the leadership. > Disallow old MG majority to hijack leadership > --------------------------------------------- > > Key: IGNITE-22904 > URL: https://issues.apache.org/jira/browse/IGNITE-22904 > Project: Ignite > Issue Type: Improvement > Reporter: Roman Puchkovskiy > Assignee: Roman Puchkovskiy > Priority: Major > Labels: iep-128, ignite-3 > Time Spent: 40m > Remaining Estimate: 0h > > If some node did not see Metastorage repair, it will be migrated to the new > cluster using the Migrate REST/CLI command. Such a node (judging by its local > MG Raft log) might still think it's a member of the voting set, so it might > propose itself as a candidate, and it can win the election if there are > enough such nodes. This will result in the leadership being hijacked by the > 'old' majority, which will mess the repaired Metastorage up. This has to be > avoided. > To do so, the following should be done: > # In the CMG, add a property called mgRepairClusterId (empty in the blank > cluster) > # When, during MG repair (IGNITE-22899), we choose new metastorageNodes and > save them to the CMG (which happens before resetPeers() is called), we write > mgRepairClusterId together with metastorageNodes to the CMG > # We add a property called witnessedMgRepairClusterId to the Vault. This > property will store clusterId for the incarnation of the cluster in which the > node witnessed MG repair (either it participated in the repair, or it was > migrated and successfully performed the 'MG reentry' procedure, see below. > This property is empty on a blank node > # When a node handles > MetastorageIndexTermRequestMessage, it writes current clusterId to its > Vault.witnessedMgRepairClusterId. As a result, every node participating in > the MG repair will be marked as a witness of the repair and we'll not need to > do 'MG reentry' for them > # On node start, before starting the MG, Ignite node gets from the CMG > leader metastorageNodes and mgRepairClusterId. If it's not null and > Vault.witnessedMgRepairClusterId is absent or differs from mgRepairClusterId, > then the node has to perform the 'MG reentry' procedure. > # The 'MG reentry' procedure is as follows: > ## The node destroys all 3 Raft storages for MG (these are meta, log, > snapshot storage) as well as Metastorage KV storage > ## Writes current clusterId to Vault.witnessedMgRepairClusterId > ## Then starts the MG Raft server as usual > h2. Old (stale) description > If, during a join (on getting the fresh cluster state from the CMG), a node > detects that, according to the MG configuration saved in the MG on this node, > this node is the member of the voting set (i.e. it’s a peer, not a learner), > and this node is NOT one of the metastorageNodes in the CMG, then, before > starting its MG Raft member, it raises a flag that disallows its Raft node > becoming a candidate. > (This flag does not exist in JRaft, we need to introduce it there; the flag > is not persisted). > As soon as the Raft node applies a new Raft configuration (coming from the > new leader), this flag is cleared. > After this, the Raft node is ‘converted’ to the new MG and cannot hijack the > leadership. -- This message was sent by Atlassian Jira (v8.20.10#820010)