[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

2017-08-22 Thread William Brown
On Tue, 2017-08-22 at 09:03 +0200, Ludwig Krispenz wrote:
> On 08/22/2017 01:31 AM, William Brown wrote:
> >>> I have a question / concern though. I thought that we want dbscan 2
> >>> ldif for emergency recovery scenarios when all else has gone bad and
> >>> assuming that id2entry is still readable. In the approach you
> >>> described we make the assumption that the parentid index is readable
> >>> as well. So we depend on two files instead of one for exporting the
> >>> database. Does this matter or we don't care at all?
> >> There are two scenarios here in my opinion.  Backup, and emergency
> >> backup :-)  As I've previously stated: performance is important.  It
> >> should not take forever to process a 100 million entry database.  I
> >> think the tool should use multiple index files (id2entry + friends) if
> >> we can generate the LDIF faster.  But, if some of those indexes are
> >> corrupted, then we need an alternate algorithm to generate it just from
> >> id2entry.  Also, if we are dealing with a corrupted db, then performance
> >> is not important, recovery is.  So if we can do it fast, do it,
> >> otherwise grind it out.
> >>
> >> All that being said there is something we need to consider, which I
> >> don't have an answer for, and that is when databases do get corrupted
> >> which files typically get corrupted?  Is it indexes, or is it id2entry?
> >> To be honest database corruption doesn't happen very often, but the tool
> >> should be smart enough to realize that the data could be inaccurate.
> >> Perhaps a parent could be missing, etc.  So the tool should be robust
> >> enough to use multiple techniques to complete an entry, and if it can't
> >> it should log something, or better yet create a rejects file that an
> >> Admin can take and repair manually.
> >>
> >> I know this is getting more complicated, but we need to keep these
> >> things in mind.
> >>
> >> Regards,
> >> Mark
> > With the current design of id2entry and friends, we can't automatically
> > detect this so easily. I think we should really just have a flag on
> > dbscan that says "ignore everything BUT id2entry" and recover all you
> > can. We should leave this to a human to make that call.
> >
> > If our database had proper checksumming of content and pages, we could
> > detect this, but today that's not the case :(
> well, BDB has db_verify to ensure that a db file is consistent in itself 
> and can be processed, this should be good enough to decide if it is usable.
> But, as Mark mentioned backup, if the backup is an online backup we can 
> not be sure that the id2entry alone is sane backup/restore relies on 
> backup of txn logs and recovery on restore. That said, after a crash we 
> can also have the situation that pages are not flushed from dbcache.
> Generating an ldif from id2entry can be a best effort only and might 
> fail in situations wher it is most needed.
> 
> About the general strategy, maybe we can use another approach (yes, 
> always new ideas).
> In generating the ldif we have to solve the problem that when cursoring 
> thru id2entry child entries can come before parent entries. And we do it 
> in different places: total replication init, db2ldif and now in a utility.
> Wouldn't it be better to make ldif import smarter, and stack entries 
> without parents until the parent is read ? This would simplify the 
> export,init and be solved in one place.


It sounds like we could add a mechanism to id2entry to solve this that
has flags to use entryrdn or other related db's to speed up the sort.
This would satisfy this request that we do this "in one place, and one
place well"


Perhaps Ilias should investigate the other places where we try to sort
by parent to get an idea of what we might need to change? 


-- 
Sincerely,

William Brown
Software Engineer
Red Hat, Australia/Brisbane



signature.asc
Description: This is a digitally signed message part
___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

2017-08-22 Thread Ludwig Krispenz


On 08/22/2017 01:31 AM, William Brown wrote:

I have a question / concern though. I thought that we want dbscan 2
ldif for emergency recovery scenarios when all else has gone bad and
assuming that id2entry is still readable. In the approach you
described we make the assumption that the parentid index is readable
as well. So we depend on two files instead of one for exporting the
database. Does this matter or we don't care at all?

There are two scenarios here in my opinion.  Backup, and emergency
backup :-)  As I've previously stated: performance is important.  It
should not take forever to process a 100 million entry database.  I
think the tool should use multiple index files (id2entry + friends) if
we can generate the LDIF faster.  But, if some of those indexes are
corrupted, then we need an alternate algorithm to generate it just from
id2entry.  Also, if we are dealing with a corrupted db, then performance
is not important, recovery is.  So if we can do it fast, do it,
otherwise grind it out.

All that being said there is something we need to consider, which I
don't have an answer for, and that is when databases do get corrupted
which files typically get corrupted?  Is it indexes, or is it id2entry?
To be honest database corruption doesn't happen very often, but the tool
should be smart enough to realize that the data could be inaccurate.
Perhaps a parent could be missing, etc.  So the tool should be robust
enough to use multiple techniques to complete an entry, and if it can't
it should log something, or better yet create a rejects file that an
Admin can take and repair manually.

I know this is getting more complicated, but we need to keep these
things in mind.

Regards,
Mark

With the current design of id2entry and friends, we can't automatically
detect this so easily. I think we should really just have a flag on
dbscan that says "ignore everything BUT id2entry" and recover all you
can. We should leave this to a human to make that call.

If our database had proper checksumming of content and pages, we could
detect this, but today that's not the case :(
well, BDB has db_verify to ensure that a db file is consistent in itself 
and can be processed, this should be good enough to decide if it is usable.
But, as Mark mentioned backup, if the backup is an online backup we can 
not be sure that the id2entry alone is sane backup/restore relies on 
backup of txn logs and recovery on restore. That said, after a crash we 
can also have the situation that pages are not flushed from dbcache.
Generating an ldif from id2entry can be a best effort only and might 
fail in situations wher it is most needed.


About the general strategy, maybe we can use another approach (yes, 
always new ideas).
In generating the ldif we have to solve the problem that when cursoring 
thru id2entry child entries can come before parent entries. And we do it 
in different places: total replication init, db2ldif and now in a utility.
Wouldn't it be better to make ldif import smarter, and stack entries 
without parents until the parent is read ? This would simplify the 
export,init and be solved in one place.




___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


--
Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, Eric 
Shander

___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

2017-08-21 Thread William Brown

> >
> > I have a question / concern though. I thought that we want dbscan 2
> > ldif for emergency recovery scenarios when all else has gone bad and
> > assuming that id2entry is still readable. In the approach you
> > described we make the assumption that the parentid index is readable
> > as well. So we depend on two files instead of one for exporting the
> > database. Does this matter or we don't care at all?
> There are two scenarios here in my opinion.  Backup, and emergency
> backup :-)  As I've previously stated: performance is important.  It
> should not take forever to process a 100 million entry database.  I
> think the tool should use multiple index files (id2entry + friends) if
> we can generate the LDIF faster.  But, if some of those indexes are
> corrupted, then we need an alternate algorithm to generate it just from
> id2entry.  Also, if we are dealing with a corrupted db, then performance
> is not important, recovery is.  So if we can do it fast, do it,
> otherwise grind it out.
> 
> All that being said there is something we need to consider, which I
> don't have an answer for, and that is when databases do get corrupted
> which files typically get corrupted?  Is it indexes, or is it id2entry? 
> To be honest database corruption doesn't happen very often, but the tool
> should be smart enough to realize that the data could be inaccurate. 
> Perhaps a parent could be missing, etc.  So the tool should be robust
> enough to use multiple techniques to complete an entry, and if it can't
> it should log something, or better yet create a rejects file that an
> Admin can take and repair manually.
> 
> I know this is getting more complicated, but we need to keep these
> things in mind.
> 
> Regards,
> Mark
> >

With the current design of id2entry and friends, we can't automatically
detect this so easily. I think we should really just have a flag on
dbscan that says "ignore everything BUT id2entry" and recover all you
can. We should leave this to a human to make that call.

If our database had proper checksumming of content and pages, we could
detect this, but today that's not the case :( 

-- 
Sincerely,

William Brown
Software Engineer
Red Hat, Australia/Brisbane



signature.asc
Description: This is a digitally signed message part
___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

2017-08-21 Thread Mark Reynolds


On 08/21/2017 08:11 AM, Ilias Stamatis wrote:
>
>
> 2017-08-17 12:27 GMT+03:00 Ludwig Krispenz  >:
>
> Hi,
> Ilias' proposal follows the db2ldif approach and I think it will
> work, even if it might need some tweaks to handle multiple out of
> order oparents.
>
> An other option would be to follow the total update approach using
> the parentid index and direct get from id2entry.
> You start with the suffix entry and get the entryid. id_suf
>
>
> How do we know which one is the suffix entry in order to directly
> start with it?
>  
>
> Then lookup the parentid index for id_suf and get the direct
> children ids and access them in the id2entry. For each entry sent
> you check if it has children and ...
> The advantage would be to touch each entry once, the entries will
> be in parent first order and you always have the dn of the parent
> when writing the children, so the dn of the children can easily be
> constructed
>
>
> I like this, it's nice and simple.
>
> I have a question / concern though. I thought that we want dbscan 2
> ldif for emergency recovery scenarios when all else has gone bad and
> assuming that id2entry is still readable. In the approach you
> described we make the assumption that the parentid index is readable
> as well. So we depend on two files instead of one for exporting the
> database. Does this matter or we don't care at all?
There are two scenarios here in my opinion.  Backup, and emergency
backup :-)  As I've previously stated: performance is important.  It
should not take forever to process a 100 million entry database.  I
think the tool should use multiple index files (id2entry + friends) if
we can generate the LDIF faster.  But, if some of those indexes are
corrupted, then we need an alternate algorithm to generate it just from
id2entry.  Also, if we are dealing with a corrupted db, then performance
is not important, recovery is.  So if we can do it fast, do it,
otherwise grind it out.

All that being said there is something we need to consider, which I
don't have an answer for, and that is when databases do get corrupted
which files typically get corrupted?  Is it indexes, or is it id2entry? 
To be honest database corruption doesn't happen very often, but the tool
should be smart enough to realize that the data could be inaccurate. 
Perhaps a parent could be missing, etc.  So the tool should be robust
enough to use multiple techniques to complete an entry, and if it can't
it should log something, or better yet create a rejects file that an
Admin can take and repair manually.

I know this is getting more complicated, but we need to keep these
things in mind.

Regards,
Mark
>
> Thanks.
>
>
> Ludwig
>
>
> On 08/17/2017 02:55 AM, William Brown wrote:
>> On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote:
>>> 2017-08-15 9:56 GMT+03:00 William Brown  
>>> :
>>>
 On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:
> Hi everybody,
>
> Following Ludwig's and Mark's suggestions on how to perform a database
 dump
> in LDIF format from dbscan, I have come up with a strategy. I'm 
> talking
> about ticket #47567: https://pagure.io/389-ds-base/issue/47567
> 
>
> I'd like to discuss this strategy and get some feedback.
>
> The general idea is:
>
> - We are cursing though id2entry.db printing entries in order.
> - Parents should be printed before children.
> - Hence if for some entry parenitd > entryid, we have to print its 
> parent
> first (out of order) and track that we did so.
> - In order to do the above, we don't need to move the db cursor. We 
> can
> just go fetch something from a random place in the db and the cursor 
> will
> remain in its place, so we can continue from where we left off after
 we're
> done with printing a parent.
> - We also need to construct and print the DN of each entry using its 
> RDN
> and the DN of its father.
> - Let's use something like a hash map to pair entryids with dns (only 
> for
> entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] 
> =
> "ou=People,dc=example,dc=com", etc.
>
> I'll present the algorithm that I came up with in python-like
 pseudo-code.
> First, the following function constructs the entry's DN and updates 
> the
> hash map if needed. We can know whether an entry is a parent or not, 
> by
 the
> presence of the numSubordinates attribute.
>
> # assumes that parentid already exists in the map
> function display_entry(e, map):
> if not e.parentid:
> e.dn = e.rdn
> else:
> e.dn = e.rdn

[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

2017-08-21 Thread Ilias Stamatis
2017-08-17 12:27 GMT+03:00 Ludwig Krispenz :

> Hi,
> Ilias' proposal follows the db2ldif approach and I think it will work,
> even if it might need some tweaks to handle multiple out of order oparents.
>
> An other option would be to follow the total update approach using the
> parentid index and direct get from id2entry.
> You start with the suffix entry and get the entryid. id_suf
>

How do we know which one is the suffix entry in order to directly start
with it?


> Then lookup the parentid index for id_suf and get the direct children ids
> and access them in the id2entry. For each entry sent you check if it has
> children and ...
> The advantage would be to touch each entry once, the entries will be in
> parent first order and you always have the dn of the parent when writing
> the children, so the dn of the children can easily be constructed
>

I like this, it's nice and simple.

I have a question / concern though. I thought that we want dbscan 2 ldif
for emergency recovery scenarios when all else has gone bad and assuming
that id2entry is still readable. In the approach you described we make the
assumption that the parentid index is readable as well. So we depend on two
files instead of one for exporting the database. Does this matter or we
don't care at all?

Thanks.


> Ludwig
>
>
> On 08/17/2017 02:55 AM, William Brown wrote:
>
> On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote:
>
> 2017-08-15 9:56 GMT+03:00 William Brown  
> :
>
>
> On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:
>
> Hi everybody,
>
> Following Ludwig's and Mark's suggestions on how to perform a database
>
> dump
>
> in LDIF format from dbscan, I have come up with a strategy. I'm talking
> about ticket #47567: https://pagure.io/389-ds-base/issue/47567
>
> I'd like to discuss this strategy and get some feedback.
>
> The general idea is:
>
> - We are cursing though id2entry.db printing entries in order.
> - Parents should be printed before children.
> - Hence if for some entry parenitd > entryid, we have to print its parent
> first (out of order) and track that we did so.
> - In order to do the above, we don't need to move the db cursor. We can
> just go fetch something from a random place in the db and the cursor will
> remain in its place, so we can continue from where we left off after
>
> we're
>
> done with printing a parent.
> - We also need to construct and print the DN of each entry using its RDN
> and the DN of its father.
> - Let's use something like a hash map to pair entryids with dns (only for
> entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] =
> "ou=People,dc=example,dc=com", etc.
>
> I'll present the algorithm that I came up with in python-like
>
> pseudo-code.
>
> First, the following function constructs the entry's DN and updates the
> hash map if needed. We can know whether an entry is a parent or not, by
>
> the
>
> presence of the numSubordinates attribute.
>
> # assumes that parentid already exists in the map
> function display_entry(e, map):
> if not e.parentid:
> e.dn = e.rdn
> else:
> e.dn = e.rdn + map[e.parentid]
> if isparent(e):
> map[e.entryid] = e.dn
> print_to_ldif_format(e)
>
> Then, the main loop:
>
> map = new(hashmap)
> printed_in_advance = []
>
> for e in entries:
> if e.entryid in printed_in_advance:
> continue # skip this, we have already displayed it
>
> if e.parentid < e.entryid:
> display_entry(e, map)
>
> else:
> # we need to display parents before children
>
> list_of_parents = []
>
> p = e
> while p.parentid > p.entryid:
> p = get_from_db(key = p.parentid)
> list_of_parents.append(p) # see note below (*)
>
> for p in reversed(list_of_parents):
> display_entry(p, map)
> printed_in_advance.append(p.entryid)
>
>
> * here we store the whole entry in the list (aka its data) and not just
> its id, in order to avoid fetching it from the database again
>
> As a map, we can use a B+ tree implementation from libsds.
>
> I would like to know how the above idea sounds to you before going ahead
> and implementing it.
>
>
> Looks like a pretty good plan to me.
>
> What happens if you have this situation?
>
> rdn: a
> entryid: 1
> parentid: 2
>
> rdn: b
> entryid: 2
> parentid: 3
>
> rdn: c
> entryid: 3
> parentid: 4
>
> etc.
>
> Imagine we have to continue to work up ids to get to the parent. Can
> your algo handle this case?
>
>
> Unless I'm mistaken, yes. Provided of course that there is a root entry
> which has no parentid.
>
> This is the part that handles this:
>
>  p = e
>  while p.parentid > p.entryid:
>  p = get_from_db(key = p.parentid)
>  list_of_parents.append(p)
>
> We may jump at a few random points in the db for this and we will have to
> keep just a few entries in memory. But I think this number of entries is
> always going to be very small e.g. 1-2 or mayb

[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

2017-08-21 Thread Ilias Stamatis
2017-08-17 3:55 GMT+03:00 William Brown :

> On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote:
> > 2017-08-15 9:56 GMT+03:00 William Brown :
> >
> > > On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:
> > > > Hi everybody,
> > > >
> > > > Following Ludwig's and Mark's suggestions on how to perform a
> database
> > > dump
> > > > in LDIF format from dbscan, I have come up with a strategy. I'm
> talking
> > > > about ticket #47567: https://pagure.io/389-ds-base/issue/47567
> > > >
> > > > I'd like to discuss this strategy and get some feedback.
> > > >
> > > > The general idea is:
> > > >
> > > > - We are cursing though id2entry.db printing entries in order.
> > > > - Parents should be printed before children.
> > > > - Hence if for some entry parenitd > entryid, we have to print its
> parent
> > > > first (out of order) and track that we did so.
> > > > - In order to do the above, we don't need to move the db cursor. We
> can
> > > > just go fetch something from a random place in the db and the cursor
> will
> > > > remain in its place, so we can continue from where we left off after
> > > we're
> > > > done with printing a parent.
> > > > - We also need to construct and print the DN of each entry using its
> RDN
> > > > and the DN of its father.
> > > > - Let's use something like a hash map to pair entryids with dns
> (only for
> > > > entries that has children), e.g. map[1] = "dc=example,dc=com",
> map[4] =
> > > > "ou=People,dc=example,dc=com", etc.
> > > >
> > > > I'll present the algorithm that I came up with in python-like
> > > pseudo-code.
> > > >
> > > > First, the following function constructs the entry's DN and updates
> the
> > > > hash map if needed. We can know whether an entry is a parent or not,
> by
> > > the
> > > > presence of the numSubordinates attribute.
> > > >
> > > > # assumes that parentid already exists in the map
> > > > function display_entry(e, map):
> > > > if not e.parentid:
> > > > e.dn = e.rdn
> > > > else:
> > > > e.dn = e.rdn + map[e.parentid]
> > > > if isparent(e):
> > > > map[e.entryid] = e.dn
> > > > print_to_ldif_format(e)
> > > >
> > > > Then, the main loop:
> > > >
> > > > map = new(hashmap)
> > > > printed_in_advance = []
> > > >
> > > > for e in entries:
> > > > if e.entryid in printed_in_advance:
> > > > continue # skip this, we have already displayed it
> > > >
> > > > if e.parentid < e.entryid:
> > > > display_entry(e, map)
> > > >
> > > > else:
> > > > # we need to display parents before children
> > > >
> > > > list_of_parents = []
> > > >
> > > > p = e
> > > > while p.parentid > p.entryid:
> > > > p = get_from_db(key = p.parentid)
> > > > list_of_parents.append(p) # see note below (*)
> > > >
> > > > for p in reversed(list_of_parents):
> > > > display_entry(p, map)
> > > > printed_in_advance.append(p.entryid)
> > > >
> > > >
> > > > * here we store the whole entry in the list (aka its data) and not
> just
> > > > its id, in order to avoid fetching it from the database again
> > > >
> > > > As a map, we can use a B+ tree implementation from libsds.
> > > >
> > > > I would like to know how the above idea sounds to you before going
> ahead
> > > > and implementing it.
> > > >
> > >
> > > Looks like a pretty good plan to me.
> > >
> > > What happens if you have this situation?
> > >
> > > rdn: a
> > > entryid: 1
> > > parentid: 2
> > >
> > > rdn: b
> > > entryid: 2
> > > parentid: 3
> > >
> > > rdn: c
> > > entryid: 3
> > > parentid: 4
> > >
> > > etc.
> > >
> > > Imagine we have to continue to work up ids to get to the parent. Can
> > > your algo handle this case?
> > >
> >
> > Unless I'm mistaken, yes. Provided of course that there is a root entry
> > which has no parentid.
> >
> > This is the part that handles this:
> >
> >  p = e
> >  while p.parentid > p.entryid:
> >  p = get_from_db(key = p.parentid)
> >  list_of_parents.append(p)
> >
> > We may jump at a few random points in the db for this and we will have to
> > keep just a few entries in memory. But I think this number of entries is
> > always going to be very small e.g. 1-2 or maybe even 7-8 in more
> "extreme"
> > cases, but never more than let's say 15. I might as well be completely
> > wrong about this, so, please correct me if that's the case.
> >
> >
> > > It seems like the way you could approach this is to sort the id order
> > > you need to display them in *before* you start parsing entries. We can
> > > file allids in memory anyway, because it's just a set of 32bit ints, so
> > > even at 50,000,000 entries, this only takes 190MB of ram to fit allids.
> > > So I would approach this as an exercise for a comparison function to
> the
> > > set.
> > >
> > > universal_entry_set = [] # entryrdn / id2entry
> > > # The list of entries to print
> > > print_order = []
> > >
> > > for e in univ

[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

2017-08-17 Thread Ludwig Krispenz

Hi,
Ilias' proposal follows the db2ldif approach and I think it will work, 
even if it might need some tweaks to handle multiple out of order oparents.


An other option would be to follow the total update approach using the 
parentid index and direct get from id2entry.

You start with the suffix entry and get the entryid. id_suf
Then lookup the parentid index for id_suf and get the direct children 
ids and access them in the id2entry. For each entry sent you check if it 
has children and ...
The advantage would be to touch each entry once, the entries will be in 
parent first order and you always have the dn of the parent when writing 
the children, so the dn of the children can easily be constructed


Ludwig

On 08/17/2017 02:55 AM, William Brown wrote:

On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote:

2017-08-15 9:56 GMT+03:00 William Brown :


On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:

Hi everybody,

Following Ludwig's and Mark's suggestions on how to perform a database

dump

in LDIF format from dbscan, I have come up with a strategy. I'm talking
about ticket #47567: https://pagure.io/389-ds-base/issue/47567

I'd like to discuss this strategy and get some feedback.

The general idea is:

- We are cursing though id2entry.db printing entries in order.
- Parents should be printed before children.
- Hence if for some entry parenitd > entryid, we have to print its parent
first (out of order) and track that we did so.
- In order to do the above, we don't need to move the db cursor. We can
just go fetch something from a random place in the db and the cursor will
remain in its place, so we can continue from where we left off after

we're

done with printing a parent.
- We also need to construct and print the DN of each entry using its RDN
and the DN of its father.
- Let's use something like a hash map to pair entryids with dns (only for
entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] =
"ou=People,dc=example,dc=com", etc.

I'll present the algorithm that I came up with in python-like

pseudo-code.

First, the following function constructs the entry's DN and updates the
hash map if needed. We can know whether an entry is a parent or not, by

the

presence of the numSubordinates attribute.

# assumes that parentid already exists in the map
function display_entry(e, map):
 if not e.parentid:
 e.dn = e.rdn
 else:
 e.dn = e.rdn + map[e.parentid]
 if isparent(e):
 map[e.entryid] = e.dn
 print_to_ldif_format(e)

Then, the main loop:

map = new(hashmap)
printed_in_advance = []

for e in entries:
 if e.entryid in printed_in_advance:
 continue # skip this, we have already displayed it

 if e.parentid < e.entryid:
 display_entry(e, map)

 else:
 # we need to display parents before children

 list_of_parents = []

 p = e
 while p.parentid > p.entryid:
 p = get_from_db(key = p.parentid)
 list_of_parents.append(p) # see note below (*)

 for p in reversed(list_of_parents):
 display_entry(p, map)
 printed_in_advance.append(p.entryid)


* here we store the whole entry in the list (aka its data) and not just
its id, in order to avoid fetching it from the database again

As a map, we can use a B+ tree implementation from libsds.

I would like to know how the above idea sounds to you before going ahead
and implementing it.


Looks like a pretty good plan to me.

What happens if you have this situation?

rdn: a
entryid: 1
parentid: 2

rdn: b
entryid: 2
parentid: 3

rdn: c
entryid: 3
parentid: 4

etc.

Imagine we have to continue to work up ids to get to the parent. Can
your algo handle this case?


Unless I'm mistaken, yes. Provided of course that there is a root entry
which has no parentid.

This is the part that handles this:

  p = e
  while p.parentid > p.entryid:
  p = get_from_db(key = p.parentid)
  list_of_parents.append(p)

We may jump at a few random points in the db for this and we will have to
keep just a few entries in memory. But I think this number of entries is
always going to be very small e.g. 1-2 or maybe even 7-8 in more "extreme"
cases, but never more than let's say 15. I might as well be completely
wrong about this, so, please correct me if that's the case.



It seems like the way you could approach this is to sort the id order
you need to display them in *before* you start parsing entries. We can
file allids in memory anyway, because it's just a set of 32bit ints, so
even at 50,000,000 entries, this only takes 190MB of ram to fit allids.
So I would approach this as an exercise for a comparison function to the
set.

universal_entry_set = [] # entryrdn / id2entry
# The list of entries to print
print_order = []

for e in universal_entry_set:
 printer_order_insert_sorted(e)

So we can imagine that this would invoke a sorting function. Something
that would work is:


[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

2017-08-16 Thread William Brown
On Tue, 2017-08-15 at 22:03 +0300, Ilias Stamatis wrote:
> 2017-08-15 9:56 GMT+03:00 William Brown :
> 
> > On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:
> > > Hi everybody,
> > >
> > > Following Ludwig's and Mark's suggestions on how to perform a database
> > dump
> > > in LDIF format from dbscan, I have come up with a strategy. I'm talking
> > > about ticket #47567: https://pagure.io/389-ds-base/issue/47567
> > >
> > > I'd like to discuss this strategy and get some feedback.
> > >
> > > The general idea is:
> > >
> > > - We are cursing though id2entry.db printing entries in order.
> > > - Parents should be printed before children.
> > > - Hence if for some entry parenitd > entryid, we have to print its parent
> > > first (out of order) and track that we did so.
> > > - In order to do the above, we don't need to move the db cursor. We can
> > > just go fetch something from a random place in the db and the cursor will
> > > remain in its place, so we can continue from where we left off after
> > we're
> > > done with printing a parent.
> > > - We also need to construct and print the DN of each entry using its RDN
> > > and the DN of its father.
> > > - Let's use something like a hash map to pair entryids with dns (only for
> > > entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] =
> > > "ou=People,dc=example,dc=com", etc.
> > >
> > > I'll present the algorithm that I came up with in python-like
> > pseudo-code.
> > >
> > > First, the following function constructs the entry's DN and updates the
> > > hash map if needed. We can know whether an entry is a parent or not, by
> > the
> > > presence of the numSubordinates attribute.
> > >
> > > # assumes that parentid already exists in the map
> > > function display_entry(e, map):
> > > if not e.parentid:
> > > e.dn = e.rdn
> > > else:
> > > e.dn = e.rdn + map[e.parentid]
> > > if isparent(e):
> > > map[e.entryid] = e.dn
> > > print_to_ldif_format(e)
> > >
> > > Then, the main loop:
> > >
> > > map = new(hashmap)
> > > printed_in_advance = []
> > >
> > > for e in entries:
> > > if e.entryid in printed_in_advance:
> > > continue # skip this, we have already displayed it
> > >
> > > if e.parentid < e.entryid:
> > > display_entry(e, map)
> > >
> > > else:
> > > # we need to display parents before children
> > >
> > > list_of_parents = []
> > >
> > > p = e
> > > while p.parentid > p.entryid:
> > > p = get_from_db(key = p.parentid)
> > > list_of_parents.append(p) # see note below (*)
> > >
> > > for p in reversed(list_of_parents):
> > > display_entry(p, map)
> > > printed_in_advance.append(p.entryid)
> > >
> > >
> > > * here we store the whole entry in the list (aka its data) and not just
> > > its id, in order to avoid fetching it from the database again
> > >
> > > As a map, we can use a B+ tree implementation from libsds.
> > >
> > > I would like to know how the above idea sounds to you before going ahead
> > > and implementing it.
> > >
> >
> > Looks like a pretty good plan to me.
> >
> > What happens if you have this situation?
> >
> > rdn: a
> > entryid: 1
> > parentid: 2
> >
> > rdn: b
> > entryid: 2
> > parentid: 3
> >
> > rdn: c
> > entryid: 3
> > parentid: 4
> >
> > etc.
> >
> > Imagine we have to continue to work up ids to get to the parent. Can
> > your algo handle this case?
> >
> 
> Unless I'm mistaken, yes. Provided of course that there is a root entry
> which has no parentid.
> 
> This is the part that handles this:
> 
>  p = e
>  while p.parentid > p.entryid:
>  p = get_from_db(key = p.parentid)
>  list_of_parents.append(p)
> 
> We may jump at a few random points in the db for this and we will have to
> keep just a few entries in memory. But I think this number of entries is
> always going to be very small e.g. 1-2 or maybe even 7-8 in more "extreme"
> cases, but never more than let's say 15. I might as well be completely
> wrong about this, so, please correct me if that's the case.
> 
> 
> > It seems like the way you could approach this is to sort the id order
> > you need to display them in *before* you start parsing entries. We can
> > file allids in memory anyway, because it's just a set of 32bit ints, so
> > even at 50,000,000 entries, this only takes 190MB of ram to fit allids.
> > So I would approach this as an exercise for a comparison function to the
> > set.
> >
> > universal_entry_set = [] # entryrdn / id2entry
> > # The list of entries to print
> > print_order = []
> >
> > for e in universal_entry_set:
> > printer_order_insert_sorted(e)
> >
> > So we can imagine that this would invoke a sorting function. Something
> > that would work is:
> >
> > compare(e1, e2):
> > # if e1 parent is 0, return -1
> > # if e2 parent is 0, return 1
> > # If e1 is parent to e2, return -1
> > # If e2 is parent to e

[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

2017-08-15 Thread Ilias Stamatis
2017-08-15 9:56 GMT+03:00 William Brown :

> On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:
> > Hi everybody,
> >
> > Following Ludwig's and Mark's suggestions on how to perform a database
> dump
> > in LDIF format from dbscan, I have come up with a strategy. I'm talking
> > about ticket #47567: https://pagure.io/389-ds-base/issue/47567
> >
> > I'd like to discuss this strategy and get some feedback.
> >
> > The general idea is:
> >
> > - We are cursing though id2entry.db printing entries in order.
> > - Parents should be printed before children.
> > - Hence if for some entry parenitd > entryid, we have to print its parent
> > first (out of order) and track that we did so.
> > - In order to do the above, we don't need to move the db cursor. We can
> > just go fetch something from a random place in the db and the cursor will
> > remain in its place, so we can continue from where we left off after
> we're
> > done with printing a parent.
> > - We also need to construct and print the DN of each entry using its RDN
> > and the DN of its father.
> > - Let's use something like a hash map to pair entryids with dns (only for
> > entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] =
> > "ou=People,dc=example,dc=com", etc.
> >
> > I'll present the algorithm that I came up with in python-like
> pseudo-code.
> >
> > First, the following function constructs the entry's DN and updates the
> > hash map if needed. We can know whether an entry is a parent or not, by
> the
> > presence of the numSubordinates attribute.
> >
> > # assumes that parentid already exists in the map
> > function display_entry(e, map):
> > if not e.parentid:
> > e.dn = e.rdn
> > else:
> > e.dn = e.rdn + map[e.parentid]
> > if isparent(e):
> > map[e.entryid] = e.dn
> > print_to_ldif_format(e)
> >
> > Then, the main loop:
> >
> > map = new(hashmap)
> > printed_in_advance = []
> >
> > for e in entries:
> > if e.entryid in printed_in_advance:
> > continue # skip this, we have already displayed it
> >
> > if e.parentid < e.entryid:
> > display_entry(e, map)
> >
> > else:
> > # we need to display parents before children
> >
> > list_of_parents = []
> >
> > p = e
> > while p.parentid > p.entryid:
> > p = get_from_db(key = p.parentid)
> > list_of_parents.append(p) # see note below (*)
> >
> > for p in reversed(list_of_parents):
> > display_entry(p, map)
> > printed_in_advance.append(p.entryid)
> >
> >
> > * here we store the whole entry in the list (aka its data) and not just
> > its id, in order to avoid fetching it from the database again
> >
> > As a map, we can use a B+ tree implementation from libsds.
> >
> > I would like to know how the above idea sounds to you before going ahead
> > and implementing it.
> >
>
> Looks like a pretty good plan to me.
>
> What happens if you have this situation?
>
> rdn: a
> entryid: 1
> parentid: 2
>
> rdn: b
> entryid: 2
> parentid: 3
>
> rdn: c
> entryid: 3
> parentid: 4
>
> etc.
>
> Imagine we have to continue to work up ids to get to the parent. Can
> your algo handle this case?
>

Unless I'm mistaken, yes. Provided of course that there is a root entry
which has no parentid.

This is the part that handles this:

 p = e
 while p.parentid > p.entryid:
 p = get_from_db(key = p.parentid)
 list_of_parents.append(p)

We may jump at a few random points in the db for this and we will have to
keep just a few entries in memory. But I think this number of entries is
always going to be very small e.g. 1-2 or maybe even 7-8 in more "extreme"
cases, but never more than let's say 15. I might as well be completely
wrong about this, so, please correct me if that's the case.


> It seems like the way you could approach this is to sort the id order
> you need to display them in *before* you start parsing entries. We can
> file allids in memory anyway, because it's just a set of 32bit ints, so
> even at 50,000,000 entries, this only takes 190MB of ram to fit allids.
> So I would approach this as an exercise for a comparison function to the
> set.
>
> universal_entry_set = [] # entryrdn / id2entry
> # The list of entries to print
> print_order = []
>
> for e in universal_entry_set:
> printer_order_insert_sorted(e)
>
> So we can imagine that this would invoke a sorting function. Something
> that would work is:
>
> compare(e1, e2):
> # if e1 parent is 0, return -1
> # if e2 parent is 0, return 1
> # If e1 is parent to e2, return -1
> # If e2 is parent to e1, return 1
> # return compare of eid.
>
> Then this would create a list in order of parent relationships, and you
> can just do:
>
> for e in print_order:
>
> Which despite jumping about in the cursor, will print in order.
>
> So you'll need to look at entryrdn once, and then id2entry once for
> this.
>
> If you want to do this without en

[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

2017-08-15 Thread William Brown
On Tue, 2017-08-15 at 16:56 +1000, William Brown wrote:
> On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:
> > Hi everybody,
> > 
> > Following Ludwig's and Mark's suggestions on how to perform a database dump
> > in LDIF format from dbscan, I have come up with a strategy. I'm talking
> > about ticket #47567: https://pagure.io/389-ds-base/issue/47567
> > 
> > I'd like to discuss this strategy and get some feedback.
> > 
> > The general idea is:
> > 
> > - We are cursing though id2entry.db printing entries in order.
> > - Parents should be printed before children.
> > - Hence if for some entry parenitd > entryid, we have to print its parent
> > first (out of order) and track that we did so.
> > - In order to do the above, we don't need to move the db cursor. We can
> > just go fetch something from a random place in the db and the cursor will
> > remain in its place, so we can continue from where we left off after we're
> > done with printing a parent.
> > - We also need to construct and print the DN of each entry using its RDN
> > and the DN of its father.
> > - Let's use something like a hash map to pair entryids with dns (only for
> > entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] =
> > "ou=People,dc=example,dc=com", etc.
> > 
> > I'll present the algorithm that I came up with in python-like pseudo-code.
> > 
> > First, the following function constructs the entry's DN and updates the
> > hash map if needed. We can know whether an entry is a parent or not, by the
> > presence of the numSubordinates attribute.
> > 
> > # assumes that parentid already exists in the map
> > function display_entry(e, map):
> > if not e.parentid:
> > e.dn = e.rdn
> > else:
> > e.dn = e.rdn + map[e.parentid]
> > if isparent(e):
> > map[e.entryid] = e.dn
> > print_to_ldif_format(e)
> > 
> > Then, the main loop:
> > 
> > map = new(hashmap)
> > printed_in_advance = []
> > 
> > for e in entries:
> > if e.entryid in printed_in_advance:
> > continue # skip this, we have already displayed it
> > 
> > if e.parentid < e.entryid:
> > display_entry(e, map)
> > 
> > else:
> > # we need to display parents before children
> > 
> > list_of_parents = []
> > 
> > p = e
> > while p.parentid > p.entryid:
> > p = get_from_db(key = p.parentid)
> > list_of_parents.append(p) # see note below (*)
> > 
> > for p in reversed(list_of_parents):
> > display_entry(p, map)
> > printed_in_advance.append(p.entryid)
> > 
> > 
> > * here we store the whole entry in the list (aka its data) and not just
> > its id, in order to avoid fetching it from the database again
> > 
> > As a map, we can use a B+ tree implementation from libsds.
> > 
> > I would like to know how the above idea sounds to you before going ahead
> > and implementing it.
> > 
> 
> Looks like a pretty good plan to me.
> 
> What happens if you have this situation?
> 
> rdn: a
> entryid: 1
> parentid: 2
> 
> rdn: b
> entryid: 2
> parentid: 3
> 
> rdn: c
> entryid: 3
> parentid: 4
> 
> etc.
> 
> Imagine we have to continue to work up ids to get to the parent. Can
> your algo handle this case? 
> 
> It seems like the way you could approach this is to sort the id order
> you need to display them in *before* you start parsing entries. We can
> file allids in memory anyway, because it's just a set of 32bit ints, so
> even at 50,000,000 entries, this only takes 190MB of ram to fit allids.
> So I would approach this as an exercise for a comparison function to the
> set.
> 
> universal_entry_set = [] # entryrdn / id2entry
> # The list of entries to print
> print_order = []
> 
> for e in universal_entry_set:
> printer_order_insert_sorted(e)
> 
> So we can imagine that this would invoke a sorting function. Something
> that would work is:
> 
> compare(e1, e2):
> # if e1 parent is 0, return -1
> # if e2 parent is 0, return 1
> # If e1 is parent to e2, return -1
> # If e2 is parent to e1, return 1
> # return compare of eid.
> 
> Then this would create a list in order of parent relationships, and you
> can just do:
> 
> for e in print_order:
> 
> Which despite jumping about in the cursor, will print in order.
> 
> So you'll need to look at entryrdn once, and then id2entry once for
> this. 
> 
> If you want to do this without entryrdn, you'll need to pass id2entry
> twice. But You'll only need 1 entry in memory at a time to achieve it I
> think. It also doesn't matter about order of ids at all here.
> 
> You could easily demo this algo in py with entry objects and implement
> __cmp__, then do "sorted(list)' on it.
> 
> For the C variant, you would give the cmp function to the btree map, and
> then as it inserts, it orders them, then you can just do
> "sds_bptree_map(print_fn)" and it will apply print to each entry in the
> order.
> 
> Hope that helps,
> 
> 
> 

[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

2017-08-14 Thread William Brown
On Fri, 2017-08-11 at 17:49 +0300, Ilias Stamatis wrote:
> Hi everybody,
> 
> Following Ludwig's and Mark's suggestions on how to perform a database dump
> in LDIF format from dbscan, I have come up with a strategy. I'm talking
> about ticket #47567: https://pagure.io/389-ds-base/issue/47567
> 
> I'd like to discuss this strategy and get some feedback.
> 
> The general idea is:
> 
> - We are cursing though id2entry.db printing entries in order.
> - Parents should be printed before children.
> - Hence if for some entry parenitd > entryid, we have to print its parent
> first (out of order) and track that we did so.
> - In order to do the above, we don't need to move the db cursor. We can
> just go fetch something from a random place in the db and the cursor will
> remain in its place, so we can continue from where we left off after we're
> done with printing a parent.
> - We also need to construct and print the DN of each entry using its RDN
> and the DN of its father.
> - Let's use something like a hash map to pair entryids with dns (only for
> entries that has children), e.g. map[1] = "dc=example,dc=com", map[4] =
> "ou=People,dc=example,dc=com", etc.
> 
> I'll present the algorithm that I came up with in python-like pseudo-code.
> 
> First, the following function constructs the entry's DN and updates the
> hash map if needed. We can know whether an entry is a parent or not, by the
> presence of the numSubordinates attribute.
> 
> # assumes that parentid already exists in the map
> function display_entry(e, map):
> if not e.parentid:
> e.dn = e.rdn
> else:
> e.dn = e.rdn + map[e.parentid]
> if isparent(e):
> map[e.entryid] = e.dn
> print_to_ldif_format(e)
> 
> Then, the main loop:
> 
> map = new(hashmap)
> printed_in_advance = []
> 
> for e in entries:
> if e.entryid in printed_in_advance:
> continue # skip this, we have already displayed it
> 
> if e.parentid < e.entryid:
> display_entry(e, map)
> 
> else:
> # we need to display parents before children
> 
> list_of_parents = []
> 
> p = e
> while p.parentid > p.entryid:
> p = get_from_db(key = p.parentid)
> list_of_parents.append(p) # see note below (*)
> 
> for p in reversed(list_of_parents):
> display_entry(p, map)
> printed_in_advance.append(p.entryid)
> 
> 
> * here we store the whole entry in the list (aka its data) and not just
> its id, in order to avoid fetching it from the database again
> 
> As a map, we can use a B+ tree implementation from libsds.
> 
> I would like to know how the above idea sounds to you before going ahead
> and implementing it.
> 

Looks like a pretty good plan to me.

What happens if you have this situation?

rdn: a
entryid: 1
parentid: 2

rdn: b
entryid: 2
parentid: 3

rdn: c
entryid: 3
parentid: 4

etc.

Imagine we have to continue to work up ids to get to the parent. Can
your algo handle this case? 

It seems like the way you could approach this is to sort the id order
you need to display them in *before* you start parsing entries. We can
file allids in memory anyway, because it's just a set of 32bit ints, so
even at 50,000,000 entries, this only takes 190MB of ram to fit allids.
So I would approach this as an exercise for a comparison function to the
set.

universal_entry_set = [] # entryrdn / id2entry
# The list of entries to print
print_order = []

for e in universal_entry_set:
printer_order_insert_sorted(e)

So we can imagine that this would invoke a sorting function. Something
that would work is:

compare(e1, e2):
# if e1 parent is 0, return -1
# if e2 parent is 0, return 1
# If e1 is parent to e2, return -1
# If e2 is parent to e1, return 1
# return compare of eid.

Then this would create a list in order of parent relationships, and you
can just do:

for e in print_order:

Which despite jumping about in the cursor, will print in order.

So you'll need to look at entryrdn once, and then id2entry once for
this. 

If you want to do this without entryrdn, you'll need to pass id2entry
twice. But You'll only need 1 entry in memory at a time to achieve it I
think. It also doesn't matter about order of ids at all here.

You could easily demo this algo in py with entry objects and implement
__cmp__, then do "sorted(list)' on it.

For the C variant, you would give the cmp function to the btree map, and
then as it inserts, it orders them, then you can just do
"sds_bptree_map(print_fn)" and it will apply print to each entry in the
order.

Hope that helps,


-- 

Sincerely,

William Brown
Software Engineer
Red Hat, Australia/Brisbane



signature.asc
Description: This is a digitally signed message part
___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


[389-devel] Re: Strategy proposal for making DB dump in LDIF format from dbscan

2017-08-11 Thread Ilias Stamatis
I'm sorry, I'm not sure whether the code is properly formatted so I put it
here as well: https://paste.fedoraproject.org/paste/vwjZFK4icUZF4YFg44EoEw

Thanks,
___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org