Re: Problems setting up replicated ha-policy.

Clebert Suconic Mon, 30 Jan 2017 11:17:08 -0800

As Justin pointed out, look at the Network Health Check.  Or to use a
better infra-structure to avoid split brains.


On Mon, Jan 30, 2017 at 11:48 AM, Justin Bertram <jbert...@apache.com> wrote:
>> It does what I think it does, now my slave and my master are active. This 
>> however is acceptable, no problems yet.
>
> Actually, this is a problem.  This is the classic split-brain scenario.  
> Since both your master and slave are active with the same messages you will 
> lose data integrity.  Once the network connection between the live and (now 
> active) backup is restored there is nothing which can be done to re-integrate 
> the data since there is no way of knowing which broker has the right data.  
> This is the risk you run with a single live and backup.  To mitigate the risk 
> of split-brain you have a couple of options:
>
>   1) Invest in redundant network infrastructure (e.g. multiple NICs on each 
> machine, redundant network switches, etc.).  Obviously you'll need to perform 
> a cost/risk analysis here to determine how much your data is actually worth.
>   2) Configure a larger cluster of live/backup pairs so that if a connection 
> between nodes is lost a quorum vote can (hopefully) prevent the illegitimate 
> activation of a backup.
>   3) Similar to #2 you can use the recently added "network check" 
> functionality [1].
>
>
> Justin
>
>
> [1] http://activemq.apache.org/artemis/docs/1.5.2/network-isolation.html
>
> ----- Original Message -----
> From: "Gerrit Tamboer" <gerrit.tamb...@crv4all.com>
> To: users@activemq.apache.org
> Sent: Monday, January 30, 2017 10:03:42 AM
> Subject: Re: Problems setting up replicated ha-policy.
>
> Hi Clebert,
>
> Thanks for pointing me in the right direction, I was able to set up 
> replication with active/passive failover.
>
> I am able to stop the master or kill the master and the slave is responding 
> to it. If I start up the master again the slave replicates back to master and 
> the master becomes active. So far so good.
>
> So what I simulated now is a network outage. I did this by simply making sure 
> that the master cannot connect to the slave and vice versa (VirtualBox, 
> setting the network adapter to disabled).
> It does what I think it does, now my slave and my master are active. This 
> however is acceptable, no problems yet. But when I enable the network adapter 
> again, making sure the master and slave can connect, it does not do a 
> failback. The slave stays active, as well as the master, and they don’t seem 
> to communicate. Is this some sort of splitbrain situation?
>
> Regards,
> Gerrit
>
>
> On 27/01/17 21:25, "Clebert Suconic" <clebert.suco...@gmail.com> wrote:
>
> The only issue I found is how you are defining this:
>
> <connector name="localhost">tcp://localhost:61616</connector>
>
> on the cluster connection you are passing localhost as the node, that
> is sent to the backup, backup will try to connect to localhost which
> is itself, so it won't actually connect to the other node.
>
>
> You should pass in a valid IP that will be valid on the second node.
>
>
> Hope this helps...
>
>
> Look at the  examples/features/ha/replicated-failback-static example
>
> On Fri, Jan 27, 2017 at 9:28 AM, Clebert Suconic
> <clebert.suco...@gmail.com> wrote:
>> I won't be able to get to a computer today. Only on Monday.
>>
>>
>> Meanwhile can you compare your config with the replicated examples from the
>> release? That's what I would do anyways.
>>
>>
>> Try with a single live/backup.  Make sure the Id match on the backup so it
>> can pull the data.
>>
>> Let me know how it goes. I may find a time to open a computer this
>> afternoon.
>>
>> On Fri, Jan 27, 2017 at 5:32 AM Gerrit Tamboer <gerrit.tamb...@crv4all.com>
>> wrote:
>>>
>>> Hi Clebert,
>>>
>>> Thanks for pointing this out.
>>>
>>> I just tested 1.5.2 but unfortunately the results are exactly the same. No
>>> failover situation although the slave sees the master going down. The slave
>>> does not even notice a master being gone after a kill -9.
>>>
>>> This leads me to believe I have a misconfiguration, because if this is
>>> designed to work like this, it’s not really HA .
>>>
>>> I have added the broker.xml’s of all nodes to this mail again, hopefully
>>> somebody has a simular setup and can verify the configuration.
>>>
>>> Thanks a bunch!
>>>
>>> Regards,
>>> Gerrit Tamboer
>>>
>>>
>>> On 27/01/17 04:33, "Clebert Suconic" <clebert.suco...@gmail.com> wrote:
>>>
>>> Until recently (1.5.0) you would only have the TTL to decide when to
>>> activate backup.
>>>
>>>
>>> Recently connection failures will also play in the decision to activate
>>> it.
>>>
>>>
>>> So on 1.3.0 you will be bound to the TTL of the cluster connection.
>>>
>>>
>>> On 1.5.2 ir should work with kill but you would still be bound to TTL in
>>> case of a cable cut or switch of but that's the deal of tcp-ip
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jan 26, 2017 at 7:24 AM Gerrit Tambour
>>> <gerrit.tamb...@crv4all.com>
>>> wrote:
>>>
>>> > Forgot to send the attachments!
>>> >
>>> >
>>> >
>>> > *From: *Gerrit Tamboer <gerrit.tamb...@crv4all.com>
>>> > *Date: *Thursday 26 January 2017 at 13:23
>>> > *To: *"users@activemq.apache.org" <users@activemq.apache.org>
>>> > *Subject *Problems setting up replicated ha-policy.
>>> >
>>> >
>>> >
>>> > Hi community,
>>> >
>>> >
>>> >
>>> > We are attempting to setup a 3 node Artemis (1.3.0) cluster with an
>>> > active-passive failover situation. We see that the master node is
>>> > actively
>>> > accepting connections:
>>> >
>>> >
>>> >
>>> > 09:52:30,167 INFO  [org.apache.activemq.artemis.core.server] AMQ221000:
>>> > live Message Broker is starting with configuration Broker Configuration
>>> > (clustered=true
>>> >
>>> > ,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/large-messages,pagingDirectory=/opt/jamq_paging_data/data)
>>> >
>>> > 09:52:33,176 INFO  [org.apache.activemq.artemis.core.server] AMQ221020:
>>> > Started Acceptor at 0.0.0.0:61616 for protocols
>>> > [CORE,MQTT,AMQP,HORNETQ,STOMP,OPENWIRE]
>>> >
>>> >
>>> >
>>> > The slaves are able to connect to the master and are reporting that they
>>> > are in standby mode:
>>> >
>>> >
>>> >
>>> > 08:16:57,426 INFO  [org.apache.activemq.artemis.core.server] AMQ221000:
>>> > backup Message Broker is starting with configuration Broker Configuration
>>> > (clustered=true,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/large-messages,pagingDirectory=/opt/jamq_paging_data/data)
>>> >
>>> > 08:18:38,529 INFO  [org.apache.activemq.artemis.core.server] AMQ221109:
>>> > Apache ActiveMQ Artemis Backup Server version 1.3.0 [null] started, 
>>> > waiting
>>> > live to fail before it gets active
>>> >
>>> >
>>> >
>>> > However, when I kill the master node now, it reports that the master is
>>> > gone , but does not become active itself:
>>> >
>>> >
>>> >
>>> > 08:20:14,987 WARN  [org.apache.activemq.artemis.core.client] AMQ212037:
>>> > Connection failure has been detected: AMQ119015: The connection was
>>> > disconnected because of server shutdown [code=DISCONNECTED]
>>> >
>>> >
>>> >
>>> > When I do a kill -9 on the PID of the master java process, it does not
>>> > even report that the master has gone away.
>>> >
>>> > I also tested this in Artemis 1.5.1, with the same results. Also
>>> > removing
>>> > one of the slaves (to have a simple master-slave setup), also does not
>>> > work.
>>> >
>>> > My expectation is that if the master dies, one of the slaves becomes
>>> > active.
>>> >
>>> > Attached you will find the broker.xml of all 3 nodes.
>>> >
>>> >
>>> >
>>> > Thanks in advance for the help!
>>> >
>>> >
>>> >
>>> > Kind regards,
>>> >
>>> > Gerrit Tamboer
>>> >
>>> >
>>> >
>>> >
>>> > This message is subject to the following E-mail Disclaimer. (
>>> > http://www.crv4all.com/disclaimer-email/) CRV Holding B.V. seats
>>> > according to the articles of association in Arnhem, Dutch trade number
>>> > 09125050.
>>> >
>>> --
>>> Clebert Suconic
>>>
>>>
>>> This message is subject to the following E-mail Disclaimer.
>>> (http://www.crv4all.com/disclaimer-email/) CRV Holding B.V. seats according
>>> to the articles of association in Arnhem, Dutch trade number 09125050.
>>
>> --
>> Clebert Suconic
>
>
>
> --
> Clebert Suconic
>
>
> This message is subject to the following E-mail Disclaimer. 
> (http://www.crv4all.com/disclaimer-email/) CRV Holding B.V. seats according 
> to the articles of association in Arnhem, Dutch trade number 09125050.



-- 
Clebert Suconic

Re: Problems setting up replicated ha-policy.

Reply via email to