[ 
https://issues.apache.org/jira/browse/IGNITE-22904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Puchkovskiy updated IGNITE-22904:
---------------------------------------
    Description: 
If some node did not see Metastorage repair, it will be migrated to the new 
cluster using the Migrate REST/CLI command. Such a node (judging by its local 
MG Raft log) might still think it's a member of the voting set, so it might 
propose itself as a candidate, and it can win the election if there are enough 
such nodes. This will result in the leadership being hijacked by the 'old' 
majority, which will mess the repaired Metastorage up. This has to be avoided.

To do so, the following should be done:
 # In the CMG, add a property called mgRepairClusterId (empty in the blank 
cluster)
 # When, during MG repair (IGNITE-22899), we choose new metastorageNodes and 
save them to the CMG (which happens before resetPeers() is called), we write 
mgRepairClusterId together with metastorageNodes to the CMG
 # We add a property called witnessedMgRepairClusterId to the Vault. This 
property will store clusterId for the incarnation of the cluster in which the 
node witnessed MG repair (either it participated in the repair, or it was 
migrated and successfully performed the 'MG reentry' procedure, see below. This 
property is empty on a blank node
 # When a node handles 
MetastorageIndexTermRequestMessage,  it writes current clusterId to its 
Vault.witnessedMgRepairClusterId. As a result, every node participating in the 
MG repair will be marked as a witness of the repair and we'll not need to do 
'MG reentry' for them
 # On node start, before starting the MG, Ignite node gets from the CMG leader 
metastorageNodes and mgRepairClusterId. If it's not null and 
Vault.witnessedMgRepairClusterId is absent or differs from mgRepairClusterId, 
then the node has to perform the 'MG reentry' procedure.
 # The 'MG reentry' procedure is as follows:
 ## The node destroys all 3 Raft storages for MG (these are meta, log, snapshot 
storage) as well as Metastorage KV storage
 ## Writes current clusterId to Vault.witnessedMgRepairClusterId
 ## Then starts the MG Raft server as usual

h2. Old (stale) description

If, during a join (on getting the fresh cluster state from the CMG), a node 
detects that, according to the MG configuration saved in the MG on this node, 
this node is the member of the voting set (i.e. it’s a peer, not a learner), 
and this node is NOT one of the metastorageNodes in the CMG, then, before 
starting its MG Raft member, it raises a flag that disallows its Raft node 
becoming a candidate.

(This flag does not exist in JRaft, we need to introduce it there; the flag is 
not persisted).

As soon as the Raft node applies a new Raft configuration (coming from the new 
leader), this flag is cleared.

After this, the Raft node is ‘converted’ to the new MG and cannot hijack the 
leadership.

  was:
If, during a join (on getting the fresh cluster state from the CMG), a node 
detects that, according to the MG configuration saved in the MG on this node, 
this node is the member of the voting set (i.e. it’s a peer, not a learner), 
and this node is NOT one of the metastorageNodes in the CMG, then, before 
starting its MG Raft member, it raises a flag that disallows its Raft node 
becoming a candidate.

(This flag does not exist in JRaft, we need to introduce it there; the flag is 
not persisted).

As soon as the Raft node applies a new Raft configuration (coming from the new 
leader), this flag is cleared.

After this, the Raft node is ‘converted’ to the new MG and cannot hijack the 
leadership.


> Disallow old MG majority to hijack leadership
> ---------------------------------------------
>
>                 Key: IGNITE-22904
>                 URL: https://issues.apache.org/jira/browse/IGNITE-22904
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: iep-128, ignite-3
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> If some node did not see Metastorage repair, it will be migrated to the new 
> cluster using the Migrate REST/CLI command. Such a node (judging by its local 
> MG Raft log) might still think it's a member of the voting set, so it might 
> propose itself as a candidate, and it can win the election if there are 
> enough such nodes. This will result in the leadership being hijacked by the 
> 'old' majority, which will mess the repaired Metastorage up. This has to be 
> avoided.
> To do so, the following should be done:
>  # In the CMG, add a property called mgRepairClusterId (empty in the blank 
> cluster)
>  # When, during MG repair (IGNITE-22899), we choose new metastorageNodes and 
> save them to the CMG (which happens before resetPeers() is called), we write 
> mgRepairClusterId together with metastorageNodes to the CMG
>  # We add a property called witnessedMgRepairClusterId to the Vault. This 
> property will store clusterId for the incarnation of the cluster in which the 
> node witnessed MG repair (either it participated in the repair, or it was 
> migrated and successfully performed the 'MG reentry' procedure, see below. 
> This property is empty on a blank node
>  # When a node handles 
> MetastorageIndexTermRequestMessage,  it writes current clusterId to its 
> Vault.witnessedMgRepairClusterId. As a result, every node participating in 
> the MG repair will be marked as a witness of the repair and we'll not need to 
> do 'MG reentry' for them
>  # On node start, before starting the MG, Ignite node gets from the CMG 
> leader metastorageNodes and mgRepairClusterId. If it's not null and 
> Vault.witnessedMgRepairClusterId is absent or differs from mgRepairClusterId, 
> then the node has to perform the 'MG reentry' procedure.
>  # The 'MG reentry' procedure is as follows:
>  ## The node destroys all 3 Raft storages for MG (these are meta, log, 
> snapshot storage) as well as Metastorage KV storage
>  ## Writes current clusterId to Vault.witnessedMgRepairClusterId
>  ## Then starts the MG Raft server as usual
> h2. Old (stale) description
> If, during a join (on getting the fresh cluster state from the CMG), a node 
> detects that, according to the MG configuration saved in the MG on this node, 
> this node is the member of the voting set (i.e. it’s a peer, not a learner), 
> and this node is NOT one of the metastorageNodes in the CMG, then, before 
> starting its MG Raft member, it raises a flag that disallows its Raft node 
> becoming a candidate.
> (This flag does not exist in JRaft, we need to introduce it there; the flag 
> is not persisted).
> As soon as the Raft node applies a new Raft configuration (coming from the 
> new leader), this flag is cleared.
> After this, the Raft node is ‘converted’ to the new MG and cannot hijack the 
> leadership.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to