Re: Consumer Delta Sync Lost After Provider Restarted

Quanah Gibson-Mount Tue, 06 Jul 2021 14:30:19 -0700

--On Tuesday, June 29, 2021 6:50 PM +0000 thomaswilliampritch...@gmail.comwrote:

Hi,

I'm experiencing an issue between my 3 providers and multiple consumer
setup and delta sync repl. We manage a primary, or active, provider and
send all writes to the primary as long as it's healthy letting the two
others replicate and be standby providers ready to take over in the event
of a failure. All consumers replicate from all providers and all
providers replicate from all providers. After the system was running
healthily for over a week a standby provider was restarted. This caused
my consumers to re-establish the persistent sync connection. Upon
re-establishing the connection, some consumers began a sync refresh with
the following message.

Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep1: rid=003
starting refresh (sending
cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;2021011922595
5.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;2021
0226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#00
0000;20210610231714.990702Z#000000#066#000000;20210614191744.122968Z#0000
00#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684
843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Jun 28
18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003
LDAP_RES_SEARCH_RESULT Jun 28 18:32:55 openldap-hdb-consumer
slapd[15746]: do_syncrep2: rid=003 delta-sync lost sync, switching to
REFRESH Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2:
rid=003 (4096) Content Sync Refresh Required

This was re-establishing a connection with rid=003 which is
"20210412175600.595586Z#000000#835#000000" (a standby system) however we
have only been sending writes to server #44d# (the primary provider). We
see 44d CSN is over 7 days old, beyond our providers access log period.
On  the consumer that did not trigger sync refresh we see

Jun 28 18:32:55 openldap-hdb-consumer slapd[24439]: do_syncrep1: rid=003
starting refresh (sending
cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;2021011922595
5.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;2021
0226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#00
0000;20210621212459.620195Z#000000#066#000000;20210621214400.407867Z#0000
00#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684
843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Here we
see 20210621214400.407867Z#000000#44d#000000 is much more recent and did
not trigger a full resync, although it is close to the 7 day threshold at
this point. We notice the rid=003 835 csn is the same as the consumer
experiencing the problem which makes me believe the #44d# csn being old
is what causes this sync refresh.

I am concerned why when the standby provider is restarted the connection
is getting re-established with old provider CSNs, when I search the CSNs
on the consumers they look newer than the ones used to reestablish the
connection. If we restart slapd on the providers after running consumers
for 7 days it seems like it will trigger a sync refresh. How can we make
the consumers re-establish the connection with the most recent CSN?
Replication is working as expected, just the CSNs seem to remain old in
this connection message. The sync refresh behavior causes a large load on
the consumers and providers spiking bind times and degrading service
making this concerning for our production environment.

The actual age of the CSN is generally immaterial, as long as that is whatcurrent CSN on the provider is. I.e., if the CSN on 835 provider *foritself* matches what was on the consumer, that's fine. The real issueseems to be that the consumer stopped recieving updates for CSN 44d, sowhen the session was bounced for any given provider, the consumer was goingto go into REFRESH. What you need to determine is why that consumerstopped receiving updates, as this would trigger a refresh no matter whichprovider got bounced since none of the providers would have the dataavailable in their accesslog.

I generally advise using some type of monitoring on the CSNs for eachserver so you can quickly be notified when such an issue has arisen. Iwould note that your syncrepl configurations do not specify any keepalivesettings which is generally recommended so that if some type of networkdevice (load balancers and other traffic management systems do this) closesthe syncrepl connection, slapd can detect this and re-establish it.


Regards,
Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>

Re: Consumer Delta Sync Lost After Provider Restarted

Reply via email to