Hi Igor, Each node in the group monitors the state of its counterparts by sending ping commands using "com.sleepycat.je.rep.util.DbPing". The older versions of BDB JE environment did not react promptly on majority loss. As result, the monitoring functionality was added to detect when other nodes in the group became unreachable and generate special operational logs. The newer BDB JE versions are much better with detecting other node disconnects but BDB JE does not provide a good API which would allow listen for the state of other nodes in the group. Thus, the implemented monitoring functionality was left in the broker.
The db ping is issued periodically with interval defined in context variable "qpid.bdb.ha.remote_node_monitor_interval". By default, the value is set to 1000 milliseconds. The monitoring functionality awaits for timeout (specified in context variable "qpid.bdb.ha.remote_node_monitor_timeout") for db ping to receive the status details of other node. The timeout default value is set to 1000 milliseconds as well. If timeout expires and response for db ping is not received, the node is considered UNREACHABLE and operational logs are generated like the ones below HA-1006 : Left : Node : 'TEST_HA' (testnode1:5050) HA-1010 : Role change reported: Node : 'TEST_TEST_HA' (testnode1:5050) :from 'REPLICA' to 'UNREACHABLE'" There is a third timeout for the DBPing to open socket to remote node. The value for this timeout is set with context variable qpid.bdb.ha.db_ping_socket_timeout. The default value is also set to 1000 milliseconds. It seems that in your environment the default values for timeouts can be increased in order to reduce the amount of misleading logs reported due to expiration of timeouts on receiving db ping responses. I think that default timeout values are a bit optimistic. I think it would be more realistic to set qpid.bdb.ha.remote_node_monitor_timeout to 5seconds and qpid.bdb.ha.db_ping_socket_timeout to 3seconds or even greater values. The timeouts can be overridden in multiple ways: * using JVM system properties, for example export QPID_OPTS="-Dqpid.bdb.ha.remote_node_monitor_timeout=5000 -Dqpid.bdb.ha.db_ping_socket_timeout=3000" * by setting context variables on broker level (globally for all BDB HA Virtual host nodes) or directly on BDB HA Virtual host node level If required, the context variable overrides can be added into customized initial broker configuration which can be used to create new broker instances. Kind Regards, Alex On Thu, 11 Apr 2019 at 17:34, Igor Natanzon <igor.natan...@gmail.com> wrote: > Broker-J 7.1.1 > I have a following HA node setup: > Datacenter 1: node 1 (master) > Datacenter 2: node 2 (replica), node 3 (replica) > > All nodes are set to sync because data preservation and consistency is > critical for this use-case (latency not high priority). > Ping between the two datacenters (which are geographically nearly each > other) is on average 2ms. > > What I see on a periodic basis is the constant timeouts between the nodes, > which always recover approximately one second later. > > Log on node 1: > > 2019-04-10 20:35:24,898 WARN [Group-Change-Learner:fix_test:RCO_TEST_HA1] > (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Timeout whilst determining > state for node 'STO2_TEST_HA' from group fix_test > 2019-04-10 20:35:24,899 INFO [Group-Change-Learner:fix_test:RCO_TEST_HA1] > (q.m.h.left) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)] HA-1006 > : Left : Node : 'STO2_TEST_HA' (node2:5060) > 2019-04-10 20:35:24,899 INFO [Group-Change-Learner:fix_test:RCO_TEST_HA1] > (q.m.h.role_changed) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)] > HA-1010 : Role change reported: Node : 'STO2_TEST_HA' (node2:5060) : from > 'MASTER' to 'UNREACHABLE' > 2019-04-10 20:35:25,907 WARN [Group-Change-Learner:fix_test:RCO_TEST_HA1] > (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Node 'STO2_TEST_HA' from > group fix_test is responding again. > 2019-04-10 20:35:25,907 INFO [Group-Change-Learner:fix_test:RCO_TEST_HA1] > (q.m.h.joined) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)] > HA-1005 : Joined : Node : 'STO2_TEST_HA' (node2:5060) > 2019-04-10 20:35:25,907 INFO [Group-Change-Learner:fix_test:RCO_TEST_HA1] > (q.m.h.role_changed) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)] > HA-1010 : Role change reported: Node : 'STO2_TEST_HA' (node2:5060) : from > 'UNREACHABLE' to 'MASTER' > > > Sample log entry on node 2 > > 2019-04-11 06:15:20,769 WARN [Group-Change-Learner:fix_test:STO2_TEST_HA] > (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Timeout whilst determining > state for node 'RCO_TEST_HA1' from group fix_test > 2019-04-11 06:15:20,775 INFO [Group-Change-Learner:fix_test:STO2_TEST_HA] > (q.m.h.left) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)] HA-1006 > : Left : Node : 'RCO_TEST_HA1' (node1:5000) > 2019-04-11 06:15:20,775 INFO [Group-Change-Learner:fix_test:STO2_TEST_HA] > (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)] > HA-1010 : Role change reported: Node : 'RCO_TEST_HA1' (node1:5000) : from > 'REPLICA' to 'UNREACHABLE' > 2019-04-11 06:15:21,782 WARN [Group-Change-Learner:fix_test:STO2_TEST_HA] > (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Node 'RCO_TEST_HA1' from > group fix_test is responding again. > 2019-04-11 06:15:21,784 INFO [Group-Change-Learner:fix_test:STO2_TEST_HA] > (q.m.h.joined) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)] > HA-1005 : Joined : Node : 'RCO_TEST_HA1' (node1:5000) > 2019-04-11 06:15:21,784 INFO [Group-Change-Learner:fix_test:STO2_TEST_HA] > (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)] > HA-1010 : Role change reported: Node : 'RCO_TEST_HA1' (node1:5000) : from > 'UNREACHABLE' to 'REPLICA' > > > > Log on node 3 is the same. > > This warning repeats pretty much every 30 minutes or so (not consistent). > It doesn't really cause any harm, as I am assuming that there are latency > spikes between the datacenters causing this issue, so is there a way to > tune replication to be a bit less sensitive to it? Specifically its obvious > that it successfully recovers replication 1 second after it detects a > problem (that pattern is very consistent, going all the way back to > January). Here's an example of 1, also recovering 1 second later: > > > 2019-01-23 18:00:38,495 WARN [Group-Change-Learner:fix_test:STO_TEST_HA] > (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Timeout whilst determining > state for node 'TEST_HA' from group fix_test > 2019-01-23 18:00:38,495 INFO [Group-Change-Learner:fix_test:STO_TEST_HA] > (q.m.h.left) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)] HA-1006 > : Left : Node : 'TEST_HA' (testnode1:5050) > 2019-01-23 18:00:38,496 INFO [Group-Change-Learner:fix_test:STO_TEST_HA] > (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)] > HA-1010 : Role change reported: Node : 'TEST_TEST_HA' (testnode1:5050) : > from 'REPLICA' to 'UNREACHABLE' > 2019-01-23 18:00:39,505 WARN [Group-Change-Learner:fix_test:STO_TEST_HA] > (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Node 'TEST_HA' from group > fix_test is responding again. > 2019-01-23 18:00:39,506 INFO [Group-Change-Learner:fix_test:STO_TEST_HA] > (q.m.h.joined) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)] > HA-1005 : Joined : Node : 'TEST_HA' (testnode1:5050) > 2019-01-23 18:00:39,506 INFO [Group-Change-Learner:fix_test:STO_TEST_HA] > (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)] > HA-1010 : Role change reported: Node : 'TEST_HA' (testnode1:5050) : from > 'UNREACHABLE' to 'REPLICA' > > > > It might be that HA is a bit over-sensitive to latency spikes, so I wonder > if I could make it more tolerant (maybe wait an extra second?). > > Thanks! >