When using network replication between a live and a backup it is extremely important that the network connection between the two brokers is reliable because if the network connection dies and there are only those 2 nodes operating in the cluster then there will be a "split brain" where both the live and the backup are active simultaneously.
To mitigate this risk you should configure multiple live-backup pairs to participate in the cluster so that the backup can perform a legitimate quorum vote when the live dies (or the network connection between the two dies). You can also use the network monitor [1] to mitigate this as well as mentioned on another thread regarding this issue. In general, I don't recommend people run a single live/backup pair as the risk of split brain is typically just too high. Justin [1] https://activemq.apache.org/artemis/docs/latest/network-isolation.html On Fri, Sep 22, 2017 at 10:03 AM, boris_snp <boris.godu...@spglobal.com> wrote: > I have to restart my 2-broker cluster on daily basis due to the following > sequence of events: > ------------------------------------------------------------ > ---------------------- > master > 04:51:14,501 AMQ212037: Connection failure has been detected: > AMQ119014: Did > not receive data from /10.202.147.99:58739 within the 60,000ms connection > TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] > 04:51:14,510 AMQ222092: Connection to the backup node failed, removing > replication now: > ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT > message=AMQ119014: Did not receive data from /10.202.147.99:58739 within > the > 60,000ms connection TTL. The connection will now be closed.] > 04:51:24,517 AMQ212041: Timed out waiting for netty channel to close > 04:51:24,517 AMQ212037: Connection failure has been detected: > AMQ119014: Did > not receive data from /10.202.147.99:58738 within the 60,000ms connection > TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] > ------------------------------------------------------------ > ---------------------- > slave > 04:51:42,306 > AMQ212037: Connection failure has been detected: AMQ119011: Did not receive > data from server for > org.apache.activemq.artemis.core.remoting.impl.netty. > NettyConnection@1c54a4bc[local= > /10.202.147.99:58738, remote=nj09mhf0681/10.202.147.99:41410] > [code=CONNECTION_TIMEDOUT] > 04:51:42,316 > AMQ212037: Connection failure has been detected: AMQ119011: Did not receive > data from server for > org.apache.activemq.artemis.core.remoting.impl.netty. > NettyConnection@65ace922[local= > /10.202.147.99:58739, remote=nj09mhf0681/10.202.147.99:41410] > [code=CONNECTION_TIMEDOUT] > 04:51:46,955 AMQ221037: > ActiveMQServerImpl::serverUUID=7ffa29a0-7c48-11e7-9784-e83935127b09 to > become 'live' > 04:51:59,360 AMQ221014: 40% loaded > 04:52:01,854 AMQ221014: 81% loaded > 04:52:03,037 AMQ222028: Could not find page cache for page > PagePositionImpl > [pageNr=8, messageNr=-1, recordID=8662153341] removing it from the journal > 04:52:03,051 AMQ222028: Could not find page cache for page > PagePositionImpl > [pageNr=13, messageNr=-1, recordID=8662204094] removing it from the journal > 04:52:03,208 AMQ221003: Deploying queue jms.queue.DLQ > 04:52:03,281 AMQ221003: Deploying queue jms.queue.ExpiryQueue > 04:52:03,827 AMQ212034: There are more than one servers on the network > broadcasting the same node id. > ------------------------------------------------------------ > ---------------------- > master > 04:52:03,827 AMQ212034: There are more than one servers on the network > broadcasting the same node id. > ------------------------------------------------------------ > ---------------------- > slave > 04:52:03,910 AMQ221007: Server is now live > 04:52:04,003 AMQ221020: Started Acceptor at nj09mhf0681:41411 for > protocols > [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE] > 04:52:11,949 AMQ212034: There are more than one servers on the network > broadcasting the same node id. > ------------------------------------------------------------ > ---------------------- > I understand that at some point master (live now) loses slave and closes > connection to it. Slave (backup now) in turn detects that no master is > available and becomes live itself. Now both brokers are live and never > recover from such state. > How can I avoid restarts and have brokers recover to usable state by > themselves? > Thank you. > > > > > -- > Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User- > f2341805.html >