Hi Igor,

Each node in the group monitors the state of its counterparts by sending
ping commands using "com.sleepycat.je.rep.util.DbPing".
The older versions of BDB JE environment did not react promptly on majority
loss. As result, the monitoring functionality was added to detect when
other nodes in the group became unreachable and  generate special
operational logs. The newer BDB JE versions are much better with detecting
other node disconnects but BDB JE does not provide a good API which would
allow listen for the state of other nodes in the group.
Thus, the implemented monitoring functionality was left in the broker.

The db ping is issued periodically with interval defined in context
variable "qpid.bdb.ha.remote_node_monitor_interval". By default, the value
is set to 1000 milliseconds.
The monitoring functionality awaits for timeout (specified in context
variable "qpid.bdb.ha.remote_node_monitor_timeout") for db ping to receive
the status details of other node. The timeout default value is set to 1000
milliseconds as well. If timeout expires and response for db ping is not
received, the node is considered UNREACHABLE and operational logs are
generated like the ones below
HA-1006 : Left : Node : 'TEST_HA' (testnode1:5050)
HA-1010 : Role change reported: Node : 'TEST_TEST_HA' (testnode1:5050)
:from 'REPLICA' to 'UNREACHABLE'"

There is a third timeout for the DBPing to open socket to remote node. The
value for this timeout is set with context variable
qpid.bdb.ha.db_ping_socket_timeout. The default value is also set to 1000
milliseconds.

It seems that in your environment the default values for timeouts can be
increased in order to reduce the amount of misleading logs reported due to
expiration of timeouts on receiving db ping responses.

I think that default timeout values are a bit optimistic. I think it would
be more realistic to set qpid.bdb.ha.remote_node_monitor_timeout to
5seconds and qpid.bdb.ha.db_ping_socket_timeout to 3seconds or even greater
values.

The timeouts can be overridden in multiple ways:
* using JVM system properties, for example
   export QPID_OPTS="-Dqpid.bdb.ha.remote_node_monitor_timeout=5000
-Dqpid.bdb.ha.db_ping_socket_timeout=3000"
* by setting context variables on broker level (globally for all BDB HA
Virtual host nodes) or directly on BDB HA Virtual host node level

If required, the context variable overrides can be added into customized
initial broker configuration which can be used to create new broker
instances.

Kind Regards,
Alex


On Thu, 11 Apr 2019 at 17:34, Igor Natanzon <igor.natan...@gmail.com> wrote:

> Broker-J 7.1.1
> I have a following HA node setup:
> Datacenter 1: node 1 (master)
> Datacenter 2: node 2 (replica), node 3 (replica)
>
> All nodes are set to sync because data preservation and consistency is
> critical for this use-case (latency not high priority).
> Ping between the two datacenters (which are geographically nearly each
> other) is on average 2ms.
>
> What I see on a periodic basis is the constant timeouts between the nodes,
> which always recover approximately one second later.
>
> Log on node 1:
>
> 2019-04-10 20:35:24,898 WARN  [Group-Change-Learner:fix_test:RCO_TEST_HA1]
> (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Timeout whilst determining
> state for node 'STO2_TEST_HA' from group fix_test
> 2019-04-10 20:35:24,899 INFO  [Group-Change-Learner:fix_test:RCO_TEST_HA1]
> (q.m.h.left) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)] HA-1006
> : Left : Node : 'STO2_TEST_HA' (node2:5060)
> 2019-04-10 20:35:24,899 INFO  [Group-Change-Learner:fix_test:RCO_TEST_HA1]
> (q.m.h.role_changed) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)]
> HA-1010 : Role change reported: Node : 'STO2_TEST_HA' (node2:5060) : from
> 'MASTER' to 'UNREACHABLE'
> 2019-04-10 20:35:25,907 WARN  [Group-Change-Learner:fix_test:RCO_TEST_HA1]
> (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Node 'STO2_TEST_HA' from
> group fix_test is responding again.
> 2019-04-10 20:35:25,907 INFO  [Group-Change-Learner:fix_test:RCO_TEST_HA1]
> (q.m.h.joined) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)]
> HA-1005 : Joined : Node : 'STO2_TEST_HA' (node2:5060)
> 2019-04-10 20:35:25,907 INFO  [Group-Change-Learner:fix_test:RCO_TEST_HA1]
> (q.m.h.role_changed) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)]
> HA-1010 : Role change reported: Node : 'STO2_TEST_HA' (node2:5060) : from
> 'UNREACHABLE' to 'MASTER'
>
>
> Sample log entry on node 2
>
> 2019-04-11 06:15:20,769 WARN  [Group-Change-Learner:fix_test:STO2_TEST_HA]
> (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Timeout whilst determining
> state for node 'RCO_TEST_HA1' from group fix_test
> 2019-04-11 06:15:20,775 INFO  [Group-Change-Learner:fix_test:STO2_TEST_HA]
> (q.m.h.left) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)] HA-1006
> : Left : Node : 'RCO_TEST_HA1' (node1:5000)
> 2019-04-11 06:15:20,775 INFO  [Group-Change-Learner:fix_test:STO2_TEST_HA]
> (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)]
> HA-1010 : Role change reported: Node : 'RCO_TEST_HA1' (node1:5000) : from
> 'REPLICA' to 'UNREACHABLE'
> 2019-04-11 06:15:21,782 WARN  [Group-Change-Learner:fix_test:STO2_TEST_HA]
> (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Node 'RCO_TEST_HA1' from
> group fix_test is responding again.
> 2019-04-11 06:15:21,784 INFO  [Group-Change-Learner:fix_test:STO2_TEST_HA]
> (q.m.h.joined) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)]
> HA-1005 : Joined : Node : 'RCO_TEST_HA1' (node1:5000)
> 2019-04-11 06:15:21,784 INFO  [Group-Change-Learner:fix_test:STO2_TEST_HA]
> (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)]
> HA-1010 : Role change reported: Node : 'RCO_TEST_HA1' (node1:5000) : from
> 'UNREACHABLE' to 'REPLICA'
>
>
>
> Log on node 3 is the same.
>
> This warning repeats pretty much every 30 minutes or so (not consistent).
> It doesn't really cause any harm, as I am assuming that there are latency
> spikes between the datacenters causing this issue, so is there a way to
> tune replication to be a bit less sensitive to it? Specifically its obvious
> that it successfully recovers replication 1 second after it detects a
> problem (that pattern is very consistent, going all the way back to
> January). Here's an example of 1, also recovering 1 second later:
>
>
> 2019-01-23 18:00:38,495 WARN  [Group-Change-Learner:fix_test:STO_TEST_HA]
> (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Timeout whilst determining
> state for node 'TEST_HA' from group fix_test
> 2019-01-23 18:00:38,495 INFO  [Group-Change-Learner:fix_test:STO_TEST_HA]
> (q.m.h.left) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)] HA-1006
> : Left : Node : 'TEST_HA' (testnode1:5050)
> 2019-01-23 18:00:38,496 INFO  [Group-Change-Learner:fix_test:STO_TEST_HA]
> (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)]
> HA-1010 : Role change reported: Node : 'TEST_TEST_HA' (testnode1:5050) :
> from 'REPLICA' to 'UNREACHABLE'
> 2019-01-23 18:00:39,505 WARN  [Group-Change-Learner:fix_test:STO_TEST_HA]
> (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Node 'TEST_HA' from group
> fix_test is responding again.
> 2019-01-23 18:00:39,506 INFO  [Group-Change-Learner:fix_test:STO_TEST_HA]
> (q.m.h.joined) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)]
> HA-1005 : Joined : Node : 'TEST_HA' (testnode1:5050)
> 2019-01-23 18:00:39,506 INFO  [Group-Change-Learner:fix_test:STO_TEST_HA]
> (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)]
> HA-1010 : Role change reported: Node : 'TEST_HA' (testnode1:5050) : from
> 'UNREACHABLE' to 'REPLICA'
>
>
>
> It might be that HA is a bit over-sensitive to latency spikes, so I wonder
> if  I could make it more tolerant (maybe wait an extra second?).
>
> Thanks!
>

Reply via email to