Hello everyone,

I have a request for advice on how to approach monitoring of replication in an 
environment with approximately 30 FreeIPA servers, all in a master-master 
replication agreement, using 389-ds (389-ds-base-1.4.3.28-6). I am currently 
looking for ways to reduce the number of replicas (because there are more to 
come) and need to justify it to the architecture department with evidence based 
on experimental observations.

The problem we are facing is that our installation has started experiencing 
lags in some operations, such as adding user groups, HBAC, and SUDO rules and 
the most heaviest (by the impact) is automember-rebuild operation.
The number of entities being added is not large, with a maximum of 10 groups 
and several sudo and HBAC rules, though for automember-rebuild I don't know for 
certain cause for now I didn't figure out what operations are done internally 
by this. The "lag" manifests as latency in LDAP operations, leading to 
timeouts, which in turn causes some services that rely on Kerberos or DNS 
(because FreeIPA uses LDAP directory for everything) to go down. Our monitoring 
system also shows that the outage propagates through replicas as replication 
progresses.

The classic approach of monitoring replication agreements through the 
nsds5replicaLastUpdateStatus attribute is not sufficient. We need a more 
dynamic approach that can show the "waves" or replication sessions throughout 
the environment, which can help in further tuning replication parameters.

I am facing the following problems:

1) The only way to get full replication information currently is to turn on 
full debug for error logs. While this can be done in test environments, I 
cannot rely on it in production. I thought that BPF could be the answer, but I 
am not sure if dirsrv has internal support (predefined probe points) for it. 
Has anyone from the developers tried to use BPF to monitor some features in 
389-ds?

2) Regardless of BPF support, I can still try to implement monitoring with it, 
in conjunction with debug symbols. However, another problem is that I do not 
know the exact algorithm of the replication process. I have read this article 
(https://www.port389.org/docs/389ds/design/replication_troubleshooting.html), 
but it is still obscure for my purposes. Can you shed some light on the 
approach I should take here? In my mind, the first step should be very basic - 
attach to a set of consumer level functions responsible for receiving replica 
updates, and monitor the latency, the amount of incoming connections at a given 
point in time, and so on. But if you could point me in the right direction 
(other than just directly pointing to the repository and searching the source 
code), I would greatly appreciate it.

3) This feature 
(https://directory.fedoraproject.org/docs/389ds/design/log-operation-stats.html)
 is not supported for my version of 389-ds, is it? Is there a way to patch my 
version to support it?

Thank you in advance for your help.
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

Reply via email to