Hi, we got a quite strange behaviour in which a slapd server stops processing connections for some tens of seconds while a single thread is running 100% on a single CPU and all other CPU are almost idle. When the problem arise there is no significant iowait or disk I/O (and no swapping, that's disabled). Context switches just go near zero (from some tens of thousand to some hundreds). Load average is almost always under 2.
The server has 32G of RAM and 4 HT processors, is running openldap-2.4.54 in mirror mode (but no delta replication) using the mdb backend. The same behaviour was found also with 2.4.53. OpenLDAP is the only service running on it, apart SSH and some monitoring tools. Database maxsize is 25G around 17G are used. I'm attaching a redacted configuration of the main server (the secondary one is the same, with IDs reverted for mirror mode use) Most of the time it works just fine, processing a up to a few thousand of read query per second while having some tens of write per second. Connections are managed by HA-proxy, sending them to this server by default (used as main node). Many times these stop are short (around 10 second) and we don't lost connections, but when the problem arise and last for enough time, HAproxy switch to the second node, and we got downtimes. Staying with the secondary node we have the same behaviour. The problem manifests itself without periodicity and looking on the number of connection before it we could not see any usage peak. We tried to strace slapd threads during the problem, and they seem blocked on a mutex waiting for the one running at 100% (in a single CPU, user time). I'm attaching a top results during one of these events. >From the behaviour I was suspecting (just a wild and uninformated guess) some indexing issue, blocking all access. We tried to change tool-threads to 4 because I found it cited in some example as related to threads used for indexing, but the change has no effect. Re-reading last version of man-page, if I understand it correctly, it's effective only for slapadd etc. So a first question is: there is any other configuration parameter about indexing that I can try? Anyway I'm not sure if there is an effective indexing issue (indexes are quite basic). I was suspecting this because there are lot of writes, and there is no strace activity during the stop. I should look somewhere else? Any suggestion on further checks or configuration changes will be more than appreciated. Regards Simone
# # See slapd.conf(5) for details on configuration options. # This file should NOT be world readable. # include /usr/local/openldap/etc/openldap/schema/corba.schema include /usr/local/openldap/etc/openldap/schema/core.schema include /usr/local/openldap/etc/openldap/schema/cosine.schema include /usr/local/openldap/etc/openldap/schema/duaconf.schema include /usr/local/openldap/etc/openldap/schema/dyngroup.schema include /usr/local/openldap/etc/openldap/schema/inetorgperson.schema include /usr/local/openldap/etc/openldap/schema/java.schema include /usr/local/openldap/etc/openldap/schema/misc.schema include /usr/local/openldap/etc/openldap/schema/nis.schema include /usr/local/openldap/etc/openldap/schema/openldap.schema include /usr/local/openldap/etc/openldap/schema/ppolicy.schema include /usr/local/openldap/etc/openldap/schema/collective.schema #add OurOrganization schema include /usr/local/openldap/etc/openldap/schema/OurOrganization.schema # Allow LDAPv2 client connections. This is NOT the default. allow bind_v2 # This is for mirrormode replication serverID 11 # Global ACLs include /usr/local/openldap/etc/openldap/acls/global.acl # Do not enable referrals until AFTER you have a working directory # service AND an understanding of referrals. #referral ldap://root.openldap.org pidfile /usr/local/openldap/var/run/slapd.pid argsfile /usr/local/openldap/var/run/slapd.args # options: none sync parse shell stats2 stats ACL config filter BER conns args packets trace any # https://www.openldap.org/doc/admin24/slapdconfig.html #loglevel none #loglevel stats sync loglevel stats #loglevel none #loglevel any # The next three lines allow use of TLS for encrypting connections using a # dummy test certificate which you can generate by running # /usr/libexec/openldap/generate-server-cert.sh. Your client software may balk # at self-signed certificates, however. TLSCACertificatePath /usr/local/openldap/etc/openldap/certs TLSCACertificateFile /usr/local/openldap/etc/openldap/certs/rootCA.pem TLSCertificateFile /usr/local/openldap/etc/openldap/certs/server.crt TLSCertificateKeyFile /usr/local/openldap/etc/openldap/certs/server.key #TLSCertificateFile /etc/pki/tls/certs/ldap1_pubkey.pem #TLSCertificateKeyFile /etc/pki/tls/certs/ldap1_privkey.pem sizelimit 250000 # Setup the idle timeout to prevent app servers from taking down ldap. # logout idle clients after 30 seconds idletimeout 10 ####################################################################### # database definitions ####################################################################### ####################################################################### # Monitor ####################################################################### database monitor include /usr/local/openldap/etc/openldap/acls/monitor.acl rootdn "uid=monitor,cn=Monitor" rootpw ZZZ ####################################################################### # Database specific directives apply to this databasse until another # 'database' directive occurs ####################################################################### database mdb suffix "o=ourorg" # Where the database file are physically stored for database #directory /usr/local/openldap/var/openldap-data directory /data/openldap-data rootdn "uid=root,cn=special,o=ourorg" rootpw {SSHA}XXX monitoring on maxsize 25769803776 envflags writemap nometasync # Ourorg settings: we want uid,cn, and uniqueMember indexed # Indexing options for database index uid eq index cn eq index objectClass eq index uniqueMember eq index entryCSN,entryUUID eq tool-threads 4 ######################################################################### # FST db specific ACLs ######################################################################### include /usr/local/openldap/etc/openldap/acls/fst.acl # Give unlimited access to search this database for syncrepl limits dn.exact="uid=syncuser,cn=special,o=ourorg" size.hard=unlimited size.soft=unlimited time.hard=unlimited time.soft=unlimited limits dn.exact="uid=slaveuser,cn=special,o=ourorg" size.hard=unlimited size.soft=unlimited time.hard=unlimited time.soft=unlimited # Syncrepl Provider for ourorg db overlay syncprov # update the contextCSN in the database after either # 100 successful write operations OR # more than 10 minutes have elapsed # since the last time the contextCSN was written to the database syncprov-checkpoint 100 10 # Syncrepl provider maintains a record of last 100 successful write operations # The current design of the session log store is memory based syncprov-sessionlog 100 ############################################################################ # Syncrepl consumer directives ############################################################################ syncrepl rid=12 provider=ldaps://ldp-12.ourorg.org tls_reqcert=never bindmethod=simple binddn="uid=syncuser,cn=special,o=ourorg" credentials=YYY searchbase="o=ourorg" schemachecking=on type=refreshAndPersist retry="60 +" ############################################################################# # MirrorMode setup ############################################################################# mirrormode on # The lastmod overlay dynamically generates an entry with RDN "cn=Lastmod", rooted # at the underlying database suffix, that contains the relevant info about the last # modification that occurred in the underlying database. lastmod on
top - 09:25:26 up 14 days, 9:39, 1 user, load average: 0.63, 0.59, 0.57 Tasks: 155 total, 2 running, 99 sleeping, 0 stopped, 1 zombie Cpu0 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.3%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 32466708k total, 17732364k used, 14734344k free, 438012k buffers Swap: 0k total, 0k used, 0k free, 15743896k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21439 ldap 20 0 25.6g 12g 12g S 99.8 41.8 5606:40 slapd 24518 root 39 19 7732 5260 884 S 0.7 0.0 1:53.74 apps.plugin 2325 zabbix 20 0 99.2m 3444 2496 R 0.3 0.0 39:01.31 zabbix_agentd 24294 netdata 39 19 154m 82m 2580 S 0.3 0.3 0:58.63 netdata 24512 netdata 39 19 152m 25m 7196 S 0.3 0.1 0:12.71 python 29208 spiccard 20 0 15368 2308 1956 R 0.3 0.0 0:00.02 top 1 root 20 0 19696 2580 2256 S 0.0 0.0 0:01.61 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.09 kthreadd 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq