[jira] [Comment Edited] (ARTEMIS-2716) Implements pluggable Quorum Vote

Francesco Nigro (Jira) Mon, 14 Jun 2021 11:01:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362898#comment-17362898
 ]


Francesco Nigro edited comment on ARTEMIS-2716 at 6/14/21, 6:00 PM:
--------------------------------------------------------------------

I'm going to:
 # *remove the initial loop on primary start*: a primary start should succeed 
or fail (with errors) and it's key for admin purposes. Admins are supposed to 
check broker/machine state before restarting, so it's not just an automated 
operation, but need to be supervised
 # *deprecate/document allow-failback*: allow-failback == false turn a 
failing-back primary into a backup that can just error out on failover errors.
 In the classic replication failing-back master forget its Node ID if any error 
happen on failover and restart as an empty backup. On broker restart, it got a 
different NodeID and become live. This process is explained on 
https://issues.apache.org/jira/browse/ARTEMIS-3345 

The latter decision has been made to enforce what the primary role is meant to 
be: a live candidate and an occasional/temporary backup, ready to failback 
ASAP. The allow-failback == false use case should be used to perform a manual 
failback (by restarting backup acting as live), but NOT to let a primary to 
become a natural-born long-living backup: this because the primary itself has 
been (re)started due to a manual intervention after a previous outage.

Right now no automatic primary restarts are safe to happen due to the journal 
misalignment issue explained on 
https://issues.apache.org/jira/browse/ARTEMIS-3340.


 A failure during the failback process it's perfectly fine to fail-fast given 
that should be an all-or-nothing admin operation.

A failure during a failover (because the backup acting as live has rejected the 
initial failback request) is still uncertain which behaviour should follow:
 * a natural-born backup would just search for other lives to pair/sync with, 
because as backup, it's supposed to help other brokers
 * a primary is probably fine to just stop, because there is no point into 
restarting as primary (risking to become live with a misaligned journal) or 
behaving like a natural-born long-living backup ie the mentioned above behaviour

 This change is debatable and we can open a discussion on the PR about it.


was (Author: nigrofranz):
I'm going to:
 # *remove the initial loop on primary start*: a primary start should succeed 
or fail (with errors) and it's key for admin purposes. Admins are supposed to 
check broker/machine state before restarting, so it's not just an automated 
operation, but need to be supervised
 # *deprecate/document allow-failback*: allow-failback == false turn a 
failing-back primary into a backup that can just error out on failover errors.
 In the classic replication failing-back master forget its Node ID if any error 
happen on failover and restart as an empty backup. On broker restart, it got a 
different NodeID and become live. This process is explained on 
https://issues.apache.org/jira/browse/ARTEMIS-3345

 

The latter decision has been made to enforce what the primary role is meant to 
be: mostly a live candidate and an occasional/temporary backup, ready to 
failback ASAP.
 A failure during the failback process it's perfectly fine to fail-fast given 
that should be an all-or-nothing admin operation.

A failure during a proper failover (because backup acting as live has rejected 
the initial failback request) is still uncertain which behaviour should follow:
 * a natural-born backup would just search for other lives to pair/sync with, 
because as backup, it's supposed to help other brokers
 * a primary is probably fine to just stop, because there is no point into 
restarting as primary (risking to become live with a misaligned journal) or 
behaving like a natural-born backup ie the mentioned above behaviour

 This change is debatable and we can open a discussion on the PR about it.

> Implements pluggable Quorum Vote
> --------------------------------
>
>                 Key: ARTEMIS-2716
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2716
>             Project: ActiveMQ Artemis
>          Issue Type: New Feature
>            Reporter: Francesco Nigro
>            Assignee: Francesco Nigro
>            Priority: Major
>         Attachments: backup.png, primary.png
>
>          Time Spent: 16h
>  Remaining Estimate: 0h
>
> This task aim to ideliver a new Quorum Vote mechanism for artemis with the 
> objectives:
> # to make it pluggable
> # to cleanly separate the election phase and the cluster member states
> # to simplify most common setups in both amount of configuration and 
> requirements (eg "witness" nodes could be implemented to support single 
> master-slave pairs)
> Post-actions to help people adopt it, but need to be thought upfront:
> # a clean upgrade path for current HA replication users
> # deprecate or integrate the current HA replication into the new version



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARTEMIS-2716) Implements pluggable Quorum Vote

Reply via email to