The setup: - We are setting up a cluster of 6 brokers using Artemis 2.4.0. - The cluster has 3 groups. - Each group has one master, and one slave broker pair. - The HA uses replication. - Each master broker configuration has the flag ‘check-for-live-server’ set to true. - Each slave broker configuration has the flag ‘allow-failback’ set to true. - We use static connectors for allowing cluster topology discovery. - Each broker’s static connector list includes the connectors to the other 5 servers in the cluster. - Each broker declares its acceptor. - Each broker exports its own connector information via the ‘connector-ref’ configuration element. - The acceptor and the connector URLs for each broker are identical with respect to the host and port information.
Our team ran some clustering experiments on the topology described above, and here are the reported observations: Issue A: ====== 0) Initial state: all the masters and slaves in the all groups running with their expected initial roles. 1) When the maser in the first group failed, it slave initiated quorum voting, and took over the responsibility of the msster, and became the new master. 2) Now a failure was triggered in the second group's master. But the second group's slave did not take over its responsibility. Apparently because this time this slave did not get the quorum. Note that in the first group, the old slave is acting as the current master at this time. a) To us, this meant that the current master (which was the original slave) in the first group, although acting as the master, does not vote in case a quorum voting is initiated in the cluster. Is that correct? b) Does this imply that the cluster always have to be back to the initial state (as in step 0 above) in order for the failover to take place for any master slave pair? Issue B: ====== After a failover happened successfully (as in step 1 above) in a group, the old master was brought back up. At this point we were expecting that the old master will resume the role of the new master from the current master (old slave) since the slave allows failback. But apparently no failback happened. I do not have the logs to analyze; so I am attaching the broker xml configurations of the 6 brokers I got from people who ran the actual experiments. Any insights regarding the issues will be highly appreciated. Thanks, Anindya Haldar Oracle Marketing Cloud
<<attachment: artemis-cluster-setup.zip>>