Artemis HA cluster behavior issues

Anindya Haldar Tue, 26 Jun 2018 16:51:01 -0700

The setup:

- We are setting up a cluster of 6 brokers using Artemis 2.4.0.
- The cluster has 3 groups.
- Each group has one master, and one slave broker pair.
- The HA uses replication.
- Each master broker configuration has the flag ‘check-for-live-server’ set to 
true.
- Each slave broker configuration has the flag ‘allow-failback’ set to true.
- We use static connectors for allowing cluster topology discovery.
- Each broker’s static connector list includes the connectors to the other 5 
servers in the cluster.
- Each broker declares its acceptor.
- Each broker exports its own connector information via the ‘connector-ref’ 
configuration element.
- The acceptor and the connector URLs for each broker are identical with 
respect to the host and port information.


Our team ran some clustering experiments on the topology described above, and 
here are the reported observations:

Issue A:
======
0) Initial state: all the masters and slaves in the all groups running with 
their expected initial roles.
1) When the maser in the first group failed, it slave initiated quorum voting, 
and took over the responsibility of the msster, and became the new master.
2) Now a failure was triggered in the second group's master. But the second 
group's slave did not take over its responsibility. Apparently because this 
time this slave did not get the quorum. Note that in the first group, the old 
slave is acting as the current master at this time. 

a) To us, this meant that the current master (which was the original slave) in 
the first group, although acting as the master, does not vote in case a quorum 
voting is initiated in the cluster. Is that correct?
b) Does this imply that the cluster always have to be back to the initial state 
(as in step 0 above) in order for the failover to take place for any master 
slave pair?

Issue B:
======
After a failover happened successfully (as in step 1 above) in a group, the old 
master was brought back up. At this point we were expecting that the old master 
will resume the role of the new master from the current master (old slave) 
since the slave allows failback. But apparently no failback happened.



I do not have the logs to analyze; so I am attaching the broker xml 
configurations of the 6 brokers I got from people who ran the actual 
experiments.

Any insights regarding the issues will be highly appreciated.


Thanks,
Anindya Haldar
Oracle Marketing Cloud

<<attachment: artemis-cluster-setup.zip>>

Artemis HA cluster behavior issues

Reply via email to