Hi,

I wrote up the proposed design of the performance enhancement I'm
working on for 1.14 here:
    
https://fedorahosted.org/sssd/wiki/DesignDocs/OneFourteenPerformanceImprovements

For your convenience, the text of the page is also copied below. Please
note that there is one open question..

= Feature Name =
SSSD Performance enhancements for the 1.14 release

Related ticket(s):
 * https://fedorahosted.org/sssd/ticket/2602
 * https://fedorahosted.org/sssd/ticket/2062

=== Problem statement ===
At the moment SSSD doesn't perform well in large environments. Most of
the use-cases we've had reported revolved around logins of users who are
members of large groups or a large amount of groups. Another reported
use-case was the time it takes to resolve a large group.

While workarounds are available for some of the issues (such as using
`ignore_group_members` for resolution of large groups), our goal is to be
able to perform well without these workarounds.

=== Use cases ===
 * User who is a member of a large amount of AD groups logs in to a Linux 
server that is a member of the AD domain.
 * User who is a member of a large amount of AD or IPA groups logs in to a 
Linux server that is a member of an IPA domain with a trust relationship to an 
AD domain
 * Administrator of a Linux server runs "ls -l" in a directory where files are 
owned by a large group. An example would be group called "students" in an 
university setup

=== Overview of the solution ===
During performance analysis with systemtap, we found out that the biggest
delay happens when SSSD writes an entry to the cache, especially for large 
group entries.
This is also confirmed by empirical evidence from our users, where most
deployments were OK with SSSD performance once the cache was moved to tmpfs or
even when `ignore_group_members` option was enabled.

We can't skip cache writes completely, even if no attributes changed,
because we store also the expiration timestamps in the cache. Also, even if
a single attribute (like the timestamp) changes, ldb would need to unpack the
whole entry, change the record, pack it back and then write the whole blob.

In order to mitigate the costly cache writes, we should avoid writing the
whole cache entry on every cache update, but only write the entries if something
actually changed.

To avoid this, we will split the monolithic ldb cache representing the
sysdb cache into two ldb files. One would contain the entry itself and would
be fully synchronous. The other (new one) would only contain the timestamps
and would be open using the `LDB_FLG_NOSYNC` to avoid synchronous cache writes.

This would have two advantages:
 1. If we detect that the entry hasn't changed on the LDAP server at all, we 
could avoid writing into the main ldb cache which would still be costly. We 
would use the value of the `modifyTimestamp` attribute of the LDAP entry to see 
if the entry had changed or not.
 1. The writes to the new async ldb cache would be much faster, because the 
entry is smaller and because the writes wouldn't call `fsync()` due to using 
the async flag, but rather rely on the underlying filesystem to sync the data 
to the disk.

On SSSD shutdown, we would write a canary to both the timestamp cache
and the main sysdb cache, denoting graceful shutdown. On SSSD startup,
if the canary wasn't found or if the values differ, we would just ditch
the timestamp cache, which would result in refresh and write of the entry
on the next lookup.

The basic idea is to use a combination of the operational `modifyTimestamp`
attribute and checking the entry itself to see if the entry changed at
all and if not, avoid writing to the cache.

Checking the value of `modifyTimestamp` would be enough for group entries,
which should be the first iteration of this work. For checking if other
entries (mostly users) have changed, we need to compare the value of
the attributes in the cache with what we are about to store in the cache.

Therefore, these enhancements are proposed for the 1.14 versions, sorted by the 
importance as observed with systemtap testing:
    * only write the cache entry if the `modifyTimestamp` of the original entry 
had changed. If it hasn't changed, only the timestamps would be written to the 
timestamp cache
    * if the `modifyTimestamp` had changed, compare the attributes of the cache 
entry with the attributes we are about to write. If there are no differences, 
only write to the timestamp cache
    * refactor the nested group processing to make sure expensive lookups (such 
as lookups of all members of the group, there can potentially be thousands of 
these) are only performed once and intermediate results are stored in-memory.
    * attempt to shortcut parsing the attributes of the entry returned from 
LDAP sooner. The idea behind this is that if the `modifyTimestamp` did not 
change, we can reuse the entry we already cached.

Minor enhancements in later versions might include:
    * using syncrepl in the server mode for HBAC rules and external groups in 
refreshAndPersistMode. This would provide performance benefit for legacy 
clients that rely on server's HBAC rules for access control.
    * using syncrepl in the server mode for external groups in 
refreshAndPersistMode. This would mainly simplify the external groups handling, 
rather than improve performance
    * A lot of time is spent looking up attributes in the `sysdb_attrs` array. 
This is something we might want to optimize after we're done with the cache 
writes.
    * We might even consider offering syncrepl in refreshOnly mode as an 
client-side option for enumeration. However, this would have to be an opt-in 
because every refresh causes the server to walk the changelog since the last 
refresh operation. Enabling this option on all clients would trash the server 
performance.

=== Implementation details ===
The `sysdb_ctx` already contains a handle of the main sysdb cache. We
would add another ldb file that only contains the timestamp and the DN of
an entry. This ldb file would be opened in the nosync mode. Attributes used
for lookups, like `dataExpireTimestamp` must be indexed in this database
as well.

When storing a user or a group to sysdb using functions like
`sysdb_store_user`, we first check the difference between `modifyTimestamp`
attributes. If there are no differences, only the timestamp attributes,
such as `lastUpdate` or `dataExpireTimestamp` would be updated in the
timestamp cache. We need to do this check in the lower-level sysdb calls
to make sure this enhancement also works for users and groups retrieved
through the extop plugin.

If the value of `modifyTimestamp` differs, we proceed to checking the diff
between values in the cache and the values read from LDAP.

Details about shortcut of attribute parsing will be added to this design
page later.

=== Open questions ===
When SSSD switches to another server (a replica), this replica might be
out-of-sync. To be on the safe side, we should erase the timestamp cache
to make sure we update the entries from the server. However, if the servers
are really out-of-sync, then the modifyTimestamp would differ and we would
write the entries anyway. Therefore, I think we can always rely on the
modifyTimestamp value.

=== Configuration changes ===
Currently no configuration changes are expected. We might add some if we decide 
to implement on-demand syncrepl.

=== How To Test ===
If the entries on the server did not change (except timestamps), then actions 
like user and group lookups and logins should be considerably faster.

The SSSD should also correctly detect when the entries in fact did change on 
the server. In this case, a full cache write will be performed.

=== Authors ===
 * Jakub Hrozek <jhro...@redhat.com> with the kind help of
 * Sumit Bose <sb...@redhat.com>
 * Ludwig Krispenz <lkris...@redhat.com>
 * Simo Sorce <s...@redhat.com>
_______________________________________________
sssd-devel mailing list
sssd-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/admin/lists/sssd-devel@lists.fedorahosted.org

Reply via email to