[jira] [Closed] (ARTEMIS-2713) Master failback can trigger a useless quorum vote on slave failover

Clebert Suconic (Jira) Thu, 16 Apr 2020 15:48:13 -0700


     [ 
https://issues.apache.org/jira/browse/ARTEMIS-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Clebert Suconic closed ARTEMIS-2713.
------------------------------------
    Fix Version/s: 2.12.0
       Resolution: Fixed

> Master failback can trigger a useless quorum vote on slave failover
> -------------------------------------------------------------------
>
>                 Key: ARTEMIS-2713
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2713
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.11.0
>            Reporter: Francesco Nigro
>            Priority: Major
>             Fix For: 2.12.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> A shared nothing replicated master-slave pair using check-for-live-server on 
> master and allow-failback on slave can trigger a (single or several) useless 
> quorum vote during master restart.
> The issue can happen depending on the timing by which some messages are 
> exchanged between the pair: the slave restarting as a backup perform these 
> operations:
> # async send STOP_CALLED on the connection with master used to send the 
> replica files (ie let's call it replication connection)
> # close all the connections with master, but the replication connection 
> (sending a DISCONNECT to the closing ones)
> # async send FAIL_OVER on the replication connection (waiting 5 seconds 
> before giving up and move on)
> # close the replication connection
> The master could receive the DISCONNECT before STOP_CALLED (because are 
> different connections!) believing that the slave isn't going down 
> intentionally: this will make it to fire vote-retries quorum vote. 
> Such quorum vote (in the happy path) should "quickly" complete positively, 
> making master able to fail-over anyway, because the slave is already moved on 
> and (ideally) the other brokers have "enough time" to update their topologies 
> too.
> Although performing an additional quorum vote isn't a bad thing per-se, it 
> could create an unnecessary long time window to await the observing cluster 
> to update their topologies, slowing down an operation that is supposed 
> instead to be completed quickly (on the happy path).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARTEMIS-2713) Master failback can trigger a useless quorum vote on slave failover

Reply via email to