Hi, I wrote up the proposed design of the performance enhancement I'm working on for 1.14 here: https://fedorahosted.org/sssd/wiki/DesignDocs/OneFourteenPerformanceImprovements
For your convenience, the text of the page is also copied below. Please note that there is one open question.. = Feature Name = SSSD Performance enhancements for the 1.14 release Related ticket(s): * https://fedorahosted.org/sssd/ticket/2602 * https://fedorahosted.org/sssd/ticket/2062 === Problem statement === At the moment SSSD doesn't perform well in large environments. Most of the use-cases we've had reported revolved around logins of users who are members of large groups or a large amount of groups. Another reported use-case was the time it takes to resolve a large group. While workarounds are available for some of the issues (such as using `ignore_group_members` for resolution of large groups), our goal is to be able to perform well without these workarounds. === Use cases === * User who is a member of a large amount of AD groups logs in to a Linux server that is a member of the AD domain. * User who is a member of a large amount of AD or IPA groups logs in to a Linux server that is a member of an IPA domain with a trust relationship to an AD domain * Administrator of a Linux server runs "ls -l" in a directory where files are owned by a large group. An example would be group called "students" in an university setup === Overview of the solution === During performance analysis with systemtap, we found out that the biggest delay happens when SSSD writes an entry to the cache, especially for large group entries. This is also confirmed by empirical evidence from our users, where most deployments were OK with SSSD performance once the cache was moved to tmpfs or even when `ignore_group_members` option was enabled. We can't skip cache writes completely, even if no attributes changed, because we store also the expiration timestamps in the cache. Also, even if a single attribute (like the timestamp) changes, ldb would need to unpack the whole entry, change the record, pack it back and then write the whole blob. In order to mitigate the costly cache writes, we should avoid writing the whole cache entry on every cache update, but only write the entries if something actually changed. To avoid this, we will split the monolithic ldb cache representing the sysdb cache into two ldb files. One would contain the entry itself and would be fully synchronous. The other (new one) would only contain the timestamps and would be open using the `LDB_FLG_NOSYNC` to avoid synchronous cache writes. This would have two advantages: 1. If we detect that the entry hasn't changed on the LDAP server at all, we could avoid writing into the main ldb cache which would still be costly. We would use the value of the `modifyTimestamp` attribute of the LDAP entry to see if the entry had changed or not. 1. The writes to the new async ldb cache would be much faster, because the entry is smaller and because the writes wouldn't call `fsync()` due to using the async flag, but rather rely on the underlying filesystem to sync the data to the disk. On SSSD shutdown, we would write a canary to both the timestamp cache and the main sysdb cache, denoting graceful shutdown. On SSSD startup, if the canary wasn't found or if the values differ, we would just ditch the timestamp cache, which would result in refresh and write of the entry on the next lookup. The basic idea is to use a combination of the operational `modifyTimestamp` attribute and checking the entry itself to see if the entry changed at all and if not, avoid writing to the cache. Checking the value of `modifyTimestamp` would be enough for group entries, which should be the first iteration of this work. For checking if other entries (mostly users) have changed, we need to compare the value of the attributes in the cache with what we are about to store in the cache. Therefore, these enhancements are proposed for the 1.14 versions, sorted by the importance as observed with systemtap testing: * only write the cache entry if the `modifyTimestamp` of the original entry had changed. If it hasn't changed, only the timestamps would be written to the timestamp cache * if the `modifyTimestamp` had changed, compare the attributes of the cache entry with the attributes we are about to write. If there are no differences, only write to the timestamp cache * refactor the nested group processing to make sure expensive lookups (such as lookups of all members of the group, there can potentially be thousands of these) are only performed once and intermediate results are stored in-memory. * attempt to shortcut parsing the attributes of the entry returned from LDAP sooner. The idea behind this is that if the `modifyTimestamp` did not change, we can reuse the entry we already cached. Minor enhancements in later versions might include: * using syncrepl in the server mode for HBAC rules and external groups in refreshAndPersistMode. This would provide performance benefit for legacy clients that rely on server's HBAC rules for access control. * using syncrepl in the server mode for external groups in refreshAndPersistMode. This would mainly simplify the external groups handling, rather than improve performance * A lot of time is spent looking up attributes in the `sysdb_attrs` array. This is something we might want to optimize after we're done with the cache writes. * We might even consider offering syncrepl in refreshOnly mode as an client-side option for enumeration. However, this would have to be an opt-in because every refresh causes the server to walk the changelog since the last refresh operation. Enabling this option on all clients would trash the server performance. === Implementation details === The `sysdb_ctx` already contains a handle of the main sysdb cache. We would add another ldb file that only contains the timestamp and the DN of an entry. This ldb file would be opened in the nosync mode. Attributes used for lookups, like `dataExpireTimestamp` must be indexed in this database as well. When storing a user or a group to sysdb using functions like `sysdb_store_user`, we first check the difference between `modifyTimestamp` attributes. If there are no differences, only the timestamp attributes, such as `lastUpdate` or `dataExpireTimestamp` would be updated in the timestamp cache. We need to do this check in the lower-level sysdb calls to make sure this enhancement also works for users and groups retrieved through the extop plugin. If the value of `modifyTimestamp` differs, we proceed to checking the diff between values in the cache and the values read from LDAP. Details about shortcut of attribute parsing will be added to this design page later. === Open questions === When SSSD switches to another server (a replica), this replica might be out-of-sync. To be on the safe side, we should erase the timestamp cache to make sure we update the entries from the server. However, if the servers are really out-of-sync, then the modifyTimestamp would differ and we would write the entries anyway. Therefore, I think we can always rely on the modifyTimestamp value. === Configuration changes === Currently no configuration changes are expected. We might add some if we decide to implement on-demand syncrepl. === How To Test === If the entries on the server did not change (except timestamps), then actions like user and group lookups and logins should be considerably faster. The SSSD should also correctly detect when the entries in fact did change on the server. In this case, a full cache write will be performed. === Authors === * Jakub Hrozek <jhro...@redhat.com> with the kind help of * Sumit Bose <sb...@redhat.com> * Ludwig Krispenz <lkris...@redhat.com> * Simo Sorce <s...@redhat.com> _______________________________________________ sssd-devel mailing list sssd-devel@lists.fedorahosted.org https://lists.fedorahosted.org/admin/lists/sssd-devel@lists.fedorahosted.org