[jira] [Commented] (ARTEMIS-2690) Intermittent network failure caused live and replica to both be live

Francesco Nigro (Jira) Mon, 06 Apr 2020 04:25:25 -0700


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076251#comment-17076251
 ]


Francesco Nigro commented on ARTEMIS-2690:
------------------------------------------

By reading

{quote}We dont see this happen on our less busy broker pairs.{quote}

I will focus instead of something more important: the capacity of the brokers 
of being responsive, that would affect many things (including PING and 
connectivity, and capacity of updating the topologies to perform correct 
decisions).
So, please, be sure that GC (and safepoint pauses) pauses are under control and 
enough CPU time is available on your cluster (yes: including witnesses nodes 
too).

{quote}the problem is that the master is returning to live before the slave 
completely loads{quote}

Quoting myself 
 {quote}As you have noticed, just making the master to wait before restarting, 
is enough to give the slave enough time to be reachable (again).{quote}

So you should left (in your script restarting the broker) enough time for the 
slave to be available (use curl to check if the slave is not a live and wait 
the master before restarting it). 

{quote}the problem is that the master is returning to live before the slave 
completely loads. this is not a voting issue, at least what we see proves it is 
not a voting issue{quote}

Currently "check for live server" is not requesting any quorum vote , but it 
just attempts to connect to another broker with the same node in the topology 
(if any) and if it isn't found, will make the master to start as a live: 
ideally a quorum vote would be a better approach here and is something that 
could be improved.

{quote}
what if the lave goes live due to bad voting
{quote}

Let's focus on who is deciding without requesting any quorum vote ie master on 
restart.




> Intermittent network failure caused live and replica to both be live
> --------------------------------------------------------------------
>
>                 Key: ARTEMIS-2690
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2690
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.11.0
>         Environment: Artemis 2.11.0, Ubuntu 18.04
>            Reporter: Sebastian Lövdahl
>            Priority: Major
>         Attachments: live1-artemis.log, live1-broker.xml, live2-artemis.log, 
> live2-broker.xml, live3-artemis.log, live3-broker.xml, replica1-artemis.log, 
> replica1-broker.xml
>
>
> An intermittent network failure caused both the live and replica to be live. 
> Both happily accepted incoming connections until the node that was supposed 
> to be the replica was manually shut down. Log files from all 4 nodes are 
> attached. The {{replica1}} node happened to have some TRACE logging enabled 
> as well.
>  
> As far as I have understood the documentation, the setup should be safe from 
> a split brain point of view. The live2 and live3 nodes intentionally don't 
> have any replicas at the moment. Complete {{broker.xml}} files are attached, 
> but for reference, this is the {{ha-policy}}:
> live1:
> {code:xml}
> <ha-policy>
>   <replication>
>     <master>
>       <cluster-name>my-cluster</cluster-name>
>       <group-n ame>group1</group-name>
>       <check-for-live-server>true</check-for-live-server>
>       <vote-on-replication-failure>true</vote-on-replication-failure>
>     </master>
>   </replication>
> </ha-policy>
> {code}
> replica1:
> {code:xml}
> <ha-policy>
>   <replication>
>     <slave>
>        <cluster-name>my-cluster</cluster-name>
>        <group-name>group1</group-name>
>        <allow-failback>true</allow-failback>
>        <vote-on-replication-failure>true</vote-on-replication-failure>
>     </slave>
>   </replication>
> </ha-policy>
> {code}
> live2:
> {code:xml}
> <ha-policy>
>   <replication>
>     <master>
>        <cluster-name>my-cluster</cluster-name>
>        <group-name>group2</group-name>
>        <check-for-live-server>true</check-for-live-server>
>        <vote-on-replication-failure>true</vote-on-replication-failure>
>     </master>
>   </replication>
> </ha-policy>
> {code}
> live3:
> {code:xml}
> <ha-policy>
>   <replication>
>     <master>
>        <cluster-name>my-cluster</cluster-name>
>        <group-name>group2</group-name>
>        <check-for-live-server>true</check-for-live-server>
>        <vote-on-replication-failure>true</vote-on-replication-failure>
>     </master>
>   </replication>
> </ha-policy>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARTEMIS-2690) Intermittent network failure caused live and replica to both be live

Reply via email to