--On Tuesday, June 29, 2021 6:50 PM +0000 thomaswilliampritch...@gmail.com wrote:

Hi,

I'm experiencing an issue between my 3 providers and multiple consumer
setup and delta sync repl. We manage a primary, or active, provider and
send all writes to the primary as long as it's healthy letting the two
others replicate and be standby providers ready to take over in the event
of a failure. All consumers replicate from all providers and all
providers replicate from all providers. After the system was running
healthily for over a week a standby provider was restarted. This caused
my consumers to re-establish the persistent sync connection. Upon
re-establishing the connection, some consumers began a sync refresh with
the following message.

Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep1: rid=003
starting refresh (sending
cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;2021011922595
5.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;2021
0226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#00
0000;20210610231714.990702Z#000000#066#000000;20210614191744.122968Z#0000
00#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684
843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Jun 28
18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003
LDAP_RES_SEARCH_RESULT Jun 28 18:32:55 openldap-hdb-consumer
slapd[15746]: do_syncrep2: rid=003 delta-sync lost sync, switching to
REFRESH Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2:
rid=003 (4096) Content Sync Refresh Required

This was re-establishing a connection with rid=003 which is
"20210412175600.595586Z#000000#835#000000" (a standby system) however we
have only been sending writes to server #44d# (the primary provider). We
see 44d CSN is over 7 days old, beyond our providers access log period.
On  the consumer that did not trigger sync refresh we see

Jun 28 18:32:55 openldap-hdb-consumer slapd[24439]: do_syncrep1: rid=003
starting refresh (sending
cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;2021011922595
5.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;2021
0226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#00
0000;20210621212459.620195Z#000000#066#000000;20210621214400.407867Z#0000
00#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684
843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Here we
see 20210621214400.407867Z#000000#44d#000000 is much more recent and did
not trigger a full resync, although it is close to the 7 day threshold at
this point. We notice the rid=003 835 csn is the same as the consumer
experiencing the problem which makes me believe the #44d# csn being old
is what causes this sync refresh.

I am concerned why when the standby provider is restarted the connection
is getting re-established with old provider CSNs, when I search the CSNs
on the consumers they look newer than the ones used to reestablish the
connection. If we restart slapd on the providers after running consumers
for 7 days it seems like it will trigger a sync refresh. How can we make
the consumers re-establish the connection with the most recent CSN?
Replication is working as expected, just the CSNs seem to remain old in
this connection message. The sync refresh behavior causes a large load on
the consumers and providers spiking bind times and degrading service
making this concerning for our production environment.

The actual age of the CSN is generally immaterial, as long as that is what current CSN on the provider is. I.e., if the CSN on 835 provider *for itself* matches what was on the consumer, that's fine. The real issue seems to be that the consumer stopped recieving updates for CSN 44d, so when the session was bounced for any given provider, the consumer was going to go into REFRESH. What you need to determine is why that consumer stopped receiving updates, as this would trigger a refresh no matter which provider got bounced since none of the providers would have the data available in their accesslog.

I generally advise using some type of monitoring on the CSNs for each server so you can quickly be notified when such an issue has arisen. I would note that your syncrepl configurations do not specify any keepalive settings which is generally recommended so that if some type of network device (load balancers and other traffic management systems do this) closes the syncrepl connection, slapd can detect this and re-establish it.

Regards,
Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Reply via email to