[ 
https://issues.apache.org/jira/browse/ARTEMIS-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francesco Nigro updated ARTEMIS-2713:
-------------------------------------
          Component/s: Broker
    Affects Version/s: 2.11.0

> Master failback can trigger a useless quorum vote on slave failover
> -------------------------------------------------------------------
>
>                 Key: ARTEMIS-2713
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2713
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.11.0
>            Reporter: Francesco Nigro
>            Priority: Major
>
> A shared nothing replicated master-slave pair using check-for-live-server on 
> master and allow-failback on slave can trigger a (single or several) useless 
> quorum vote during master restart.
> The issue can happen depending on the timing by which some messages are 
> exchanged between the pair: specifically the slave, while restarting as a 
> backup, will perform these operations:
> # async send STOP_CALLED on the connection with master used to send the 
> replica files (ie let's call it replication connection)
> # close all the connections with master, but the replication connection 
> (sending a DISCONNECT to the closing ones)
> # async send FAIL_OVER on the replication connection (waiting 5 seconds 
> before giving up and move on)
> # close the replication connection
> The master, in order to restart as live, could receive the DISCONNECT before 
> STOP_CALLED, believing that the slave isn't going down intentionally: this 
> will make it to fire vote-retries quorum vote. 
> Such quorum vote (in the happy path) will be positives and will make master 
> to fail-over anyway, because the slave is already moved on and (ideally) the 
> other brokers have "enough time" to update their topologies too.
> Although performing an additional quorum vote isn't a bad thing per-se, it 
> could create an unnecessary long time window to await the observing cluster 
> to update their topologies, slowing down an operation that is supposed 
> instead to be completed quickly (in the happy path).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to