[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

Ilias Stamatis Mon, 21 Aug 2017 05:13:08 -0700

2017-08-17 12:27 GMT+03:00 Ludwig Krispenz <lkris...@redhat.com>:

> Hi,
> Ilias' proposal follows the db2ldif approach and I think it will work,
> even if it might need some tweaks to handle multiple out of order oparents.
>
> An other option would be to follow the total update approach using the
> parentid index and direct get from id2entry.
> You start with the suffix entry and get the entryid. id_suf
>


How do we know which one is the suffix entry in order to directly start
with it?


> Then lookup the parentid index for id_suf and get the direct children ids
> and access them in the id2entry. For each entry sent you check if it has
> children and ...
> The advantage would be to touch each entry once, the entries will be in
> parent first order and you always have the dn of the parent when writing
> the children, so the dn of the children can easily be constructed
>

I like this, it's nice and simple.

I have a question / concern though. I thought that we want dbscan 2 ldif
for emergency recovery scenarios when all else has gone bad and assuming
that id2entry is still readable. In the approach you described we make the
assumption that the parentid index is readable as well. So we depend on two
files instead of one for exporting the database. Does this matter or we
don't care at all?

Thanks.


> Ludwig
>
>
> On 08/17/2017 02:55 AM, William Brown wrote:
>
> On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote:
>
> 2017-08-15 9:56 GMT+03:00 William Brown <wibr...@redhat.com> 
> <wibr...@redhat.com>:
>
>
> On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:
>
> Hi everybody,
>
> Following Ludwig's and Mark's suggestions on how to perform a database
>
> dump
>
> in LDIF format from dbscan, I have come up with a strategy. I'm talking
> about ticket #47567: https://pagure.io/389-ds-base/issue/47567
>
> I'd like to discuss this strategy and get some feedback.
>
> The general idea is:
>
> - We are cursing though id2entry.db printing entries in order.
> - Parents should be printed before children.
> - Hence if for some entry parenitd > entryid, we have to print its parent
> first (out of order) and track that we did so.
> - In order to do the above, we don't need to move the db cursor. We can
> just go fetch something from a random place in the db and the cursor will
> remain in its place, so we can continue from where we left off after
>
> we're
>
> done with printing a parent.
> - We also need to construct and print the DN of each entry using its RDN
> and the DN of its father.
> - Let's use something like a hash map to pair entryids with dns (only for
> entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] =
> "ou=People,dc=example,dc=com", etc.
>
> I'll present the algorithm that I came up with in python-like
>
> pseudo-code.
>
> First, the following function constructs the entry's DN and updates the
> hash map if needed. We can know whether an entry is a parent or not, by
>
> the
>
> presence of the numSubordinates attribute.
>
> # assumes that parentid already exists in the map
> function display_entry(e, map):
>     if not e.parentid:
>         e.dn = e.rdn
>     else:
>         e.dn = e.rdn + map[e.parentid]
>     if isparent(e):
>         map[e.entryid] = e.dn
>     print_to_ldif_format(e)
>
> Then, the main loop:
>
> map = new(hashmap)
> printed_in_advance = []
>
> for e in entries:
>     if e.entryid in printed_in_advance:
>         continue # skip this, we have already displayed it
>
>     if e.parentid < e.entryid:
>         display_entry(e, map)
>
>     else:
>         # we need to display parents before children
>
>         list_of_parents = []
>
>         p = e
>         while p.parentid > p.entryid:
>             p = get_from_db(key = p.parentid)
>             list_of_parents.append(p) # see note below (*)
>
>         for p in reversed(list_of_parents):
>             display_entry(p, map)
>             printed_in_advance.append(p.entryid)
>
>
> * here we store the whole entry in the list (aka its data) and not just
> its id, in order to avoid fetching it from the database again
>
> As a map, we can use a B+ tree implementation from libsds.
>
> I would like to know how the above idea sounds to you before going ahead
> and implementing it.
>
>
> Looks like a pretty good plan to me.
>
> What happens if you have this situation?
>
> rdn: a
> entryid: 1
> parentid: 2
>
> rdn: b
> entryid: 2
> parentid: 3
>
> rdn: c
> entryid: 3
> parentid: 4
>
> etc.
>
> Imagine we have to continue to work up ids to get to the parent. Can
> your algo handle this case?
>
>
> Unless I'm mistaken, yes. Provided of course that there is a root entry
> which has no parentid.
>
> This is the part that handles this:
>
>          p = e
>          while p.parentid > p.entryid:
>              p = get_from_db(key = p.parentid)
>              list_of_parents.append(p)
>
> We may jump at a few random points in the db for this and we will have to
> keep just a few entries in memory. But I think this number of entries is
> always going to be very small e.g. 1-2 or maybe even 7-8 in more "extreme"
> cases, but never more than let's say 15. I might as well be completely
> wrong about this, so, please correct me if that's the case.
>
>
>
> It seems like the way you could approach this is to sort the id order
> you need to display them in *before* you start parsing entries. We can
> file allids in memory anyway, because it's just a set of 32bit ints, so
> even at 50,000,000 entries, this only takes 190MB of ram to fit allids.
> So I would approach this as an exercise for a comparison function to the
> set.
>
> universal_entry_set = [....] # entryrdn / id2entry
> # The list of entries to print
> print_order = []
>
> for e in universal_entry_set:
>     printer_order_insert_sorted(e)
>
> So we can imagine that this would invoke a sorting function. Something
> that would work is:
>
> compare(e1, e2):
>     # if e1 parent is 0, return -1
>     # if e2 parent is 0, return 1
>     # If e1 is parent to e2, return -1
>     # If e2 is parent to e1, return 1
>     # return compare of eid.
>
> Then this would create a list in order of parent relationships, and you
> can just do:
>
> for e in print_order:
>
> Which despite jumping about in the cursor, will print in order.
>
> So you'll need to look at entryrdn once, and then id2entry once for
> this.
>
> If you want to do this without entryrdn, you'll need to pass id2entry
> twice. But You'll only need 1 entry in memory at a time to achieve it I
> think. It also doesn't matter about order of ids at all here.
>
>
> Hmm, I just didn't understand what's wrong with what I proposed previously.
> As you said with the technique you just described we either have to pass
> both entryrdn and id2entry once, or pass id2entry twice in any case. With
> what I described previously we have to pass id2entry only once (and
> probably for a few entries we will have to skip them if we come across them
> a second time).
>
> Could you explain why do you think the first one would be better / more
> efficient?
>
>
> I think your algo doesn't handle the case where you have *multiple* outo
> of order parents. Even though you have to do two passes, the way I
> proposed will guaranteed correct display regardless of id number because
> they are solely sorted on parent relationships.
>
> I think your one will get to
>
>
>         while p.parentid > p.entryid:
>
>             p = get_from_db(key = p.parentid)
>             list_of_parents.append(p) # see note below (*)
>
>         for p in reversed(list_of_parents):
>             display_entry(p, map)
>             printed_in_advance.append(p.entryid)
>
> ^ Here, what if the parent's parent is not yet printed? display_entry
> would need to be recursive to handle this (and it appears to be
> iterative).
>
> Does that help?
>
>
>
>
> _______________________________________________
> 389-devel mailing list -- 389-devel@lists.fedoraproject.org
> To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
>
>
> --
> Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn,
> Commercial register: Amtsgericht Muenchen, HRB 153243,
> Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, 
> Eric Shander
>
>
> _______________________________________________
> 389-devel mailing list -- 389-devel@lists.fedoraproject.org
> To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
>
>

_______________________________________________
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org

[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

Reply via email to