> On 27 Feb 2019, at 10:21, Rich Megginson <rmegg...@redhat.com> wrote: > > On 2/26/19 4:26 PM, William Brown wrote: >> >>> On 26 Feb 2019, at 18:32, Ludwig Krispenz <lkris...@redhat.com> wrote: >>> >>> Hi, I need a bit of time to read the docs and clear my thoughts, but one >>> comment below >>> On 02/25/2019 01:49 AM, William Brown wrote: >>>>> On 23 Feb 2019, at 02:46, Mark Reynolds <mreyno...@redhat.com> wrote: >>>>> >>>>> I want to start a brief discussion about a major problem we have backend >>>>> transaction plugins and the entry caches. I'm finding that when we get >>>>> into a nested state of be txn plugins and one of the later plugins that >>>>> is called fails then while we don't commit the disk changes (they are >>>>> aborted/rolled back) we DO keep the entry cache changes! >>>>> >>>>> For example, a modrdn operation triggers the referential integrity plugin >>>>> which renames the member attribute in some group and changes that group's >>>>> entry cache entry, but then later on the memberOf plugin fails for some >>>>> reason. The database transaction is aborted, but the entry cache changes >>>>> that RI plugin did are still present :-( I have also found other entry >>>>> cache issues with modrdn and BE TXN plugins, and we know of other >>>>> currently non-reproducible entry cache crashes as well related to >>>>> mishandling of cache entries after failed operations. >>>>> >>>>> It's time to rework how we use the entry cache. We basically need a >>>>> transaction style caching mechanism - we should not commit any entry >>>>> cache changes until the original operation is fully successful. >>>>> Unfortunately the way the entry cache is currently designed and used it >>>>> will be a major change to try to change it. >>>>> >>>>> William wrote up this doc: >>>>> http://www.port389.org/docs/389ds/design/cache_redesign.html >>>>> >>>>> But this also does not currently cover the nested plugin scenario either >>>>> (not yet). I do know how how difficult it would be to implement >>>>> William's proposal, or how difficult it would be to incorporate the txn >>>>> style caching into his design. What kind of time frame could this even >>>>> be implemented in? William what are your thoughts? >>>> I like coffee? How cool are planes? My thoughts are simple :) >>>> >>>> I think there is a pretty simple mental simplification we can make here >>>> though. Nested transactions “don’t really exist”. We just have *recursive* >>>> operations inside of one transaction. >>>> >>>> Once reframed like that, the entire situation becomes simpler. We have one >>>> thread in a write transaction that can have recursive/batched operations >>>> as required, which means that either “all operations succeed” or “none >>>> do”. Really, this is the behaviour we want anyway, and it’s the >>>> transaction model of LMDB and other kv stores that we could consider >>>> (wired tiger, sled in the future). >>> I think the recursive/nested transaction on the database level are not the >>> problem, we do this correctly already, either all or no change becomes >>> persistent. >>> What we do not manage is modifications we do in parallel on the in memory >>> structure like the entry cache, changes to the EC are not managed by any >>> txn and I do not see how any of the database txn models would help, they do >>> not know about ec and can abort changes. >>> We would need to incorporate the EC into a generic txn model, or have a way >>> to flag ec entries as garbage for if a txn is aborted >> The issue is we allow parallel writes, which breaks the consistency >> guarantees of the EC anyway. LMDB won’t allow parallel writes (it’s single >> write - concurrent parallel readers), and most other modern kv stores take >> this approach too, so we should be remodelling our transactions to match >> this IMO. It will make the process of how we reason about the EC much much >> simpler I think. > > > Some sort of in-memory data structure with fast lookup and transactional > semantics (modify operations are stored as mvcc/cow so each read of the > database with a given txn handle sees its own view of the ec, a txn commit > updates the parent txn ec view, or the global ec view if no parent, from the > copy, a txn abort deletes the txn's copy of the ec) is needed. A quick > google search turns up several hits. I'm not sure if the B+Tree proposed at > http://www.port389.org/docs/389ds/design/cache_redesign.html has > transactional semantics, or if such code could be added to its implementation.
It does, this is a MVCC B+Tree implementation. > > With LMDB, if we could make the on-disk entry representation the same as the > in-memory entry representation, then we could use LMDB as the entry cache too > - the database would be the entry cache as well. Yes, Ludwig has suggested this because it would remove the need for an Entry Cache at all. > > >> >>>>> If William's design is too huge of a change that will take too long to >>>>> safely implement then perhaps we need to look into revising the existing >>>>> cache design where we use "cache_add_tentative" style functions and only >>>>> apply them at the end of the op. This is also not a trivial change. >>>> It’s pretty massive as a change - if we want to do it right. I’d say we >>>> need: >>>> >>>> * development and testing of a MVCC/COW cache implementation (proof that >>>> it really really works transactionally) >>>> * allow “disable/disconnect” of the entry cache, but with the higher level >>>> txn’s so that we can prove the txn semantics are correct >>>> * re-architect our transaction calls so that they are “higher” up. An >>>> example is that internal_modify shouldn’t start a txn, it should be given >>>> the current txn state as an arg. Combined with the above, we can prove we >>>> haven’t corrupted our server transaction guarantees. >>>> * integrate the transactional cache. >>>> >>>> I don’t know if I would still write a transactional cache the same way as >>>> I proposed in that design, but I think the ideas are on the right path. >>>> >>>>> And what impact would changing the entry cache have on Ludwig's plugable >>>>> backend work? >>>> Should be none, it’s seperate layers. If anything this change is going to >>>> make Ludwig’s work better because our current model won’t really take good >>>> advantage of the MVCC nature of modern kv stores. >>>> >>>>> Anyway we need to start thinking about redesigning the entry cache - no >>>>> matter what approach we want to take. If anyone has any ideas or >>>>> comments please share them, but I think due to the severity of this flaw >>>>> redesigning the entry cache should be one of our next major goals in DS >>>>> (1.4.1?). >>>>> >>>>> Thanks, >>>>> >>>>> Mark >>>>> _______________________________________________ >>>>> 389-devel mailing list -- 389-devel@lists.fedoraproject.org >>>>> To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org >>>>> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html >>>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >>>>> List Archives: >>>>> https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org >>>> — >>>> Sincerely, >>>> >>>> William Brown >>>> Software Engineer, 389 Directory Server >>>> SUSE Labs >>>> _______________________________________________ >>>> 389-devel mailing list -- 389-devel@lists.fedoraproject.org >>>> To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org >>>> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html >>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >>>> List Archives: >>>> https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org >>> -- >>> Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn, >>> Commercial register: Amtsgericht Muenchen, HRB 153243, >>> Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, >>> Eric Shander >>> _______________________________________________ >>> 389-devel mailing list -- 389-devel@lists.fedoraproject.org >>> To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org >>> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html >>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >>> List Archives: >>> https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org >> — >> Sincerely, >> >> William Brown >> Software Engineer, 389 Directory Server >> SUSE Labs >> _______________________________________________ >> 389-devel mailing list -- 389-devel@lists.fedoraproject.org >> To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org >> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html >> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >> List Archives: >> https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org > > _______________________________________________ > 389-devel mailing list -- 389-devel@lists.fedoraproject.org > To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org > Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org — Sincerely, William Brown Software Engineer, 389 Directory Server SUSE Labs _______________________________________________ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org