[jira] [Commented] (ARTEMIS-2690) Intermittent network failure caused live and replica to both be live

Thomas Wood (Jira) Fri, 03 Apr 2020 03:13:17 -0700


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074454#comment-17074454
 ]


Thomas Wood commented on ARTEMIS-2690:
--------------------------------------

I can reproduce this on our bigger and busy brokers at any time. We have 4 
pairs and quorum voting is working as expected. 
You can cause a failure by restarting the master service or by disconnecting 
and reconnecting the network. Slave gets permission to go live and does, but, 
at the same time the master has also returned to being live. We dont see this 
happen on our less busy broker pairs.
It seems like the "check for live server" is not working correctly or the slave 
needs to check its connection to the master while its loading, maybe in a 
separate thread?  Ive been trying to get a debug session setup to trace this 
but cant find the time.

> Intermittent network failure caused live and replica to both be live
> --------------------------------------------------------------------
>
>                 Key: ARTEMIS-2690
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2690
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.11.0
>         Environment: Artemis 2.11.0, Ubuntu 18.04
>            Reporter: Sebastian Lövdahl
>            Priority: Major
>         Attachments: live1-artemis.log, live1-broker.xml, live2-artemis.log, 
> live2-broker.xml, live3-artemis.log, live3-broker.xml, replica1-artemis.log, 
> replica1-broker.xml
>
>
> An intermittent network failure caused both the live and replica to be live. 
> Both happily accepted incoming connections until the node that was supposed 
> to be the replica was manually shut down. Log files from all 4 nodes are 
> attached. The {{replica1}} node happened to have some TRACE logging enabled 
> as well.
>  
> As far as I have understood the documentation, the setup should be safe from 
> a split brain point of view. The live2 and live3 nodes intentionally don't 
> have any replicas at the moment. Complete {{broker.xml}} files are attached, 
> but for reference, this is the {{ha-policy}}:
> live1:
> {code:xml}
> <ha-policy>
>   <replication>
>     <master>
>       <cluster-name>my-cluster</cluster-name>
>       <group-n ame>group1</group-name>
>       <check-for-live-server>true</check-for-live-server>
>       <vote-on-replication-failure>true</vote-on-replication-failure>
>     </master>
>   </replication>
> </ha-policy>
> {code}
> replica1:
> {code:xml}
> <ha-policy>
>   <replication>
>     <slave>
>        <cluster-name>my-cluster</cluster-name>
>        <group-name>group1</group-name>
>        <allow-failback>true</allow-failback>
>        <vote-on-replication-failure>true</vote-on-replication-failure>
>     </slave>
>   </replication>
> </ha-policy>
> {code}
> live2:
> {code:xml}
> <ha-policy>
>   <replication>
>     <master>
>        <cluster-name>my-cluster</cluster-name>
>        <group-name>group2</group-name>
>        <check-for-live-server>true</check-for-live-server>
>        <vote-on-replication-failure>true</vote-on-replication-failure>
>     </master>
>   </replication>
> </ha-policy>
> {code}
> live3:
> {code:xml}
> <ha-policy>
>   <replication>
>     <master>
>        <cluster-name>my-cluster</cluster-name>
>        <group-name>group2</group-name>
>        <check-for-live-server>true</check-for-live-server>
>        <vote-on-replication-failure>true</vote-on-replication-failure>
>     </master>
>   </replication>
> </ha-policy>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARTEMIS-2690) Intermittent network failure caused live and replica to both be live

Reply via email to