Hello,

We have been lately having big problems with sssd caching. On our ssh
servers, (each with ~100-200 users) login may take several minutes as the
sssd_be -process uses 100% cpu time and sssd_be -process may be in this
state for days. Clearing the cache and restarting sssd during the day
usually helps and then everything works for few days, sometimes only hours.
It is not clear what triggers this behaviour, maybe some some combination
of lots of users and cache update at the same time.

The culprit seems to have been addition of few big groups lately to ldap
for our access policy worsening the situation and sssd-performance.

On test server simple id command and empty cache with same setttings as in
production takes:
[root@testsk tmp]# time id testusr
uid=1143(testusr) gid=100(users)
groups=100(users),3318(roam),3102(nixe),1000(staff1),3785(wl-staff1),3119(system),3402(fileaccess),3377(vpn1),120(grp2),3123(devel),1001(devel3),3378(vpn2),3266(usr),3386(access3)

real    0m28.689s
user    0m0.006s
sys    0m0.007s

We have currently several groups with around 17 000 and 3000 users so this
id query creates over 100k ghost users to cache:

[root@testsk tmp]# ldbsearch -H /var/lib/sss/db/cache_TESTAUTH.ldb |grep
ghost |wc -l
asq: Unable to register control with rootdse!
105196

Indeed, with full debug (time of id-command is then over 1 minute) all I
see in the logs ldap backend mostly adding ghost users to cache as it adds
information from _all_ groups related to that uid. As backend is not
respondind to monitor pings fast enough, monitor tries to kill it and
restart. Same happens also in production servers. I have already extended
timeout to 60 but it seems not to be enough.

This latter case seems to be relevant especially when we started to receive
complaints from some people that httpd authentication was not working.
Apache error log shows:
[Tue Oct 29 12:21:36 2013] [error] [client xxx.xx.xx.xx] GROUP: testuser
not in required group(s).
when in fact user is in the required group but it seems that sssd just
fails to respond fast enough. This is (PAM, AuthType Basic, Require group
testgroup) kind of authentication.

This is on RHEL6.4, sssd-1.9.2-82.10.el6_4.x86_64.  Configured services
nss, ldap:
sanitized config:
------------------------
[sssd]
config_file_version = 2
debug_level = 1
reconnection_retries = 3
timeout = 60
services = nss
domains = TESTAUTH
[nss]
filter_groups = root
filter_users = root
reconnection_retries = 3
debug_level = 1
[domain/TESTAUTH]
debug_level = 1
ldap_purge_cache_timeout = 3600
id_provider = ldap
auth_provider = ldap
ldap_uri = ldap://authserv.test
ldap_search_base = dc=test
ldap_user_search_base = ou=People,dc=test
ldap_group_search_base = ou=Group,dc=test

So in the end, any ideas or suggestions how to improve the situation? Of
course I'm willing to debug/test this more if needed as the current
situation is almost disastrous.

Cheers,
 - Sami

ps. Quick test on a Fedora 19 and sssd-1.11.1-4.fc19 made the same queries
in 7 seconds or less so apparently some progress in performance has been
done. Any idea when would RHEL6 sssd be rebased? I tried to compile latest
git-version on RHEL6 but I couldn't find all required components (for ex.
configure: error: you must have the cifsidmap header installed to build the
idmap plugin).
_______________________________________________
sssd-users mailing list
sssd-users@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/sssd-users

Reply via email to