Re: Update about subtree problems

2010-07-20 Thread Howard Chu

Emmanuel Lecharny wrote:

   On 7/20/10 7:46 PM, Howard Chu wrote:

Emmanuel Lecharny wrote:

Thinking about my previous mail, I may have an idea :

what if we don't store the subentry's DN into the selected entries
operational attributes, but the subentry's entryUUID ? This value
*never* changes, we have an index on it, we can also easily build a
cache for it.

It will solve the problems I mentionned with the move and rename
operations, I think.

thoughts ?


Sounds good. Pretty sure that AD uses UUID references everywhere too,
for similar reasons.

I've suggested a few times in the past that LDAP needs a
UUIDAndProvisionalName syntax. The UUID would always be present; the
DN may be present but may or may not be correct/up to date. (Of
course, it may seem senseless to carry a DN around if it isn't always
going to be correct, but the point is to reserve a space to put it for
when you do know the correct name, that's all.)

Funny enough, I was also thinking about a similat solution, but I would
have named it something like UUIDOrProvisionalName :) So that you can
store either an UUID or a DN.

The problem now is that the existing AT subschemaSubentry and
collectiveAttributeSubentries have a DN syntax... So I had to create two
new AT to store an UUID.

May be the RFC 3671/2 need an update ?


I think the general idea, at least for AD, is that it's only an internal 
design feature. From the user perspective, all you see is the DN. Internally, 
you carry the entry UUID and you only look it up in your index when you need 
to return the DN to a client.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/


Re: Update about subtree problems

2010-07-20 Thread Emmanuel Lecharny

 On 7/20/10 7:46 PM, Howard Chu wrote:

Emmanuel Lecharny wrote:

   Thinking about my previous mail, I may have an idea :

what if we don't store the subentry's DN into the selected entries
operational attributes, but the subentry's entryUUID ? This value
*never* changes, we have an index on it, we can also easily build a
cache for it.

It will solve the problems I mentionned with the move and rename
operations, I think.

thoughts ?

Sounds good. Pretty sure that AD uses UUID references everywhere too, 
for similar reasons.


I've suggested a few times in the past that LDAP needs a 
UUIDAndProvisionalName syntax. The UUID would always be present; the 
DN may be present but may or may not be correct/up to date. (Of 
course, it may seem senseless to carry a DN around if it isn't always 
going to be correct, but the point is to reserve a space to put it for 
when you do know the correct name, that's all.)
Funny enough, I was also thinking about a similat solution, but I would 
have named it something like UUIDOrProvisionalName :) So that you can 
store either an UUID or a DN.


The problem now is that the existing AT subschemaSubentry and 
collectiveAttributeSubentries have a DN syntax... So I had to create two 
new AT to store an UUID.


May be the RFC 3671/2 need an update ?


--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com



Re: Update about subtree problems

2010-07-20 Thread Howard Chu

Emmanuel Lecharny wrote:

   Thinking about my previous mail, I may have an idea :

what if we don't store the subentry's DN into the selected entries
operational attributes, but the subentry's entryUUID ? This value
*never* changes, we have an index on it, we can also easily build a
cache for it.

It will solve the problems I mentionned with the move and rename
operations, I think.

thoughts ?

Sounds good. Pretty sure that AD uses UUID references everywhere too, for 
similar reasons.


I've suggested a few times in the past that LDAP needs a 
UUIDAndProvisionalName syntax. The UUID would always be present; the DN may be 
present but may or may not be correct/up to date. (Of course, it may seem 
senseless to carry a DN around if it isn't always going to be correct, but the 
point is to reserve a space to put it for when you do know the correct name, 
that's all.)


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/


Re: Update about subtree problems

2010-07-20 Thread Emmanuel Lecharny
 Ok, let's think more in details about the last idea (using entryUUID 
instead of DN to refer to subentries). We may also need an index to map 
 ( so that we can quickly find the 
impacted entries if we modify a subentry. We may need two caches :

- one for subentries : 
- one for APs : 

The first cache is used when we need to quickly get a subentry
The second cache is used when we move an entry, to know if it move from 
one AP to another one.


Add operation :
1) we are adding a normal entry
 if it's under some APs, we evaluate it against all the associated 
subentries, and when it evaluates to true, we store the subentries 
entryUUID in the entry. We also update the  index.

2) we are adding a subentry
 for each entries under the associated AP, see if they are part of the 
SS, and if so, inject the subentryUUID into the entry. Update the E> index.


Delete operation :
1) we delete a normal entry
 nothing special to do
2) we delete a subentry
 we have to modify all the referencing entries and remove the subentry 
UUID from them. We use the  index to get this list of entries. 
Then we can delete the subentry and update the  index. One slight 
problem though : we should not be able to process any other modification 
on a subentry while it's being deleted.


Rename operation :
1) We rename a normal entry
 this entry may have children, and some of those children may be AP or 
subentries. However, as we are using entryUUID to reference a subentry, 
this operation has no impact on the entries content.

2) We rename a subentry
 assuming that the subentry will remain in the same AP, we will just 
have to update the caches. Nothing else is needed.


Move operation
1) We move a normal entry
 this entry might have APs in some of its children. They won't be 
impacted. However, as we may move the subtree  from  one AP area to 
another one, we will have to check two things :
  - first we have to list all the subentries the moved entries are 
referring to (we use the AP and subentry cache for that). this can be 
done once for the moved entry, as all the children will share the same 
subentries.
  - second, we have to find the new subentries the children will depend 
upon once moved.
 Now, we can remove the subentry references from the first set not 
present in the second in each selected entry (using the  index), 
add the new subentry references into the selected entries (and we have 
to evaluate all the entries one by one), updating the  index at 
the same time, and last, delete all the references from the old 
subentries, updating the  index.

2) We move a subentry
  Moving the subentry is a simpler operation : we update all the 
entries containg a reference to the moved subentry (using the  
index to find them), and add a new reference to the moved subentry into 
all the selected entries. It's a delete and add operation, in fact, 
applied to the subentry.

3) We move an AP
  Moving an AP is like moving a normal entry : if we have APs higher in 
the tree (ie closer to the rootDSE than the moved AP), and if we don't 
inherit from the same AP, we then have to udpate the children entries by 
removing the reference to the old subentries and add references to the 
new subentries.


Last operation, not the simplest
Modify operation
1) We modify a normal entry
 No impact if we don't try to modify the references to the subentries 
(something only an admin can do...)

2) We modify a normal entry to make it a subentry
 It's like a subentry creation.
3) We modify a subentry
 Depending on what we modify in the subentry, the consequences will be 
different :

  - if we modify the cn, it has no impact
  - if we modify the OC, for instance by removing the "subentry' 
element, then it's a deletion operation. Anything else has no impact
  - if we modify the subtreeSpecifiction, we can consider that it's 
like a deletion followed by an addition, as we have to reevaluate and 
update all the selected entries

4) We modify a subentry to make it a normal entry
 It's a deletion operation (see case (3), second modification
5) We modify an AP to make it a normal entry
 It should be forbidden if the AP has some attached subentries, 
otherwise it has no impact.



One more thing : how do we deal with aliases ?

--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com



Re: Update about subtree problems

2010-07-20 Thread Emmanuel Lecharny

 Thinking about my previous mail, I may have an idea :

what if we don't store the subentry's DN into the selected entries 
operational attributes, but the subentry's entryUUID ? This value 
*never* changes, we have an index on it, we can also easily build a 
cache for it.


It will solve the problems I mentionned with the move and rename 
operations, I think.


thoughts ?

--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com



Re: Update about subtree problems

2010-07-20 Thread Emmanuel Lecharny

 Some more thoughts...


Let's assume we keep the current implementation. What does it implies 
for a move or a rename operation ?


If we are trying to move or rename a simple entry having children, we 
may have a hug problem. Let's draft a sample, where uper case letters 
are AP and lowercase are normal entries. I will describe the hierarchy 
using a file system notation


rootDSE/
  A/
a1
a2
  B/
b1
b2/
  C/
c1
c2

Now, let's try to rename b2... It will impact the subentries under C, 
which will be renamed. So it will also impact the selected entries under 
this renamed subentry. In which order should be proceed with the 
modifications ? Currently, the move is done by simply changing one 
element in the RDN index. But renaming a subentry implies that we change 
all the associated references. We know have to check in all the b2 
children if there is an AP and some subentries, and for each of them, 
modify the operational attributes. Until this is done, we have pending 
references pointing to nowhere, because the pointed subentries have been 
renamed Problem !


It's exactly the same for a move operation...

Not exactly a nice situation :/


--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com



Re: Update about subtree problems

2010-07-20 Thread Emmanuel Lécharny

 On 7/20/10 10:53 AM, Alex Karasulu wrote:

On Tue, Jul 20, 2010 at 11:15 AM, Emmanuel Lecharnywrote:


If you have to move data in a Ldap base, User, then you have to pay the
price !



Well yes but even renames cost the same as moves if the DN is in the entry.
Someone changing an ou=People to ou=Users containing 100 Million entries
should not expect to wait hours before it completes. Plus the atomicity
issue is seriously nasty. The DN embedded into the Entry was definitely not
the way to go. In fact Kiran and Seelmann's new RDN index to replace the DN
index saved us big time making these operations atomic, faster and safer.
Faster, I don't know, but IMO, it's irrelevant. Safer, that's absolutely 
certain. But then you have the exact same problem with the subtree 
implemented approach : no atomicity, slow rename/move operations, even 
for normal entries, and no atomicity...


This is what bugs me : why should we apply a strategy for DN and the 
opposite one for subentries...



--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com



Re: Update about subtree problems

2010-07-20 Thread Emmanuel Lécharny

 On 7/20/10 10:47 AM, Alex Karasulu wrote:

Hi Emmanuel,

On Tue, Jul 20, 2010 at 2:34 AM, Emmanuel Lecharnywrote:


  Hi guys,

just an update about the current situation.

We discussed on this ML about options regarding how to handle subtrees in
the most efficient way in the server. We currently have 3 possibilities :

- the first one, which is partially implemented in trunk, consist on static
references to subentries in all the entries selected by the
subtreeSpecifications defined in each subentry.



I'm supposing you're referring to static data references in partition
Entries which reference the subentries affecting the Entry using the proper
administrative operational attributes, rather than using static Java
references in the Entry class correct?


I'm reffering to the subentry DN which are stored in the entries, as 
operational attributes, right.



- the second one suppose that we compute when needed if an entry is a
selected entry, each time we access this entry

- the third solution, which has not been discussed already, would be to
have an index which stores the list of all the selected entries associated
with each subentry.

Let's first think about the third solution (this is the one implemented by
OpenDS, btw). It's a costly one when you create a new subentry, or when you
move one, as you have to update the index with all the impacted entries.
Plus it also has a cost as you have to hit the index to get the list of
subentries an entry is part of. Of course, we can use a cache, but I'm not
sure that it will save a lot of time. In fact, AFAICT, you pay the price
twice : when creating the subentry, and each time you access any entry.

Ok, so option 3 does not seems to be reasonable. To be frank, we discussed
about t with Pierre-Arnaud but not really considered this as a decent
solution. We may be wrong though.



There might be one other option. How about adding the references to the
affecting subentries to the RDN index which replaced our old DN index?
Essentially we add an additional field to store these attributes in the RDN
value which we already have turned into a structure to hold the parent ID of
entries.
uh oh, sounds smart :) But sadly, the inference is the same : you will 
have to inject the references in *all* the RDN of *ll* the selected 
entries. It does not change the overall cost, which is still O(n).

I think this is more valuable than a separate index because first we don't
need fully bidirectional lookups. We will not be looking up entries by the
subentry DN to get a list. We will be checking to see if each entry before
returning it or modifying it is impacted by subentries. Second this RDN
index will most likely be in memory. It's so central we should have it all
cached anyway. Sort of like what AD does with it's global catalog.
Yes, we need a cache for the RDN. This will be one of the major 
improvement we can make to the server once we get 2.0 out.

So carrying this additional information in the RDN index hence the DN, will
allow us to rapidly determine which subentries impact an Entry.
see above : it will take longer to determinate if an entry is impacted, 
as you will have to check the RDN, when you already have the information 
in the op attr. Ok, only slightly longer.

Now in terms of handling administrative changes we still have the same
issue: a potentially massive numbers of updates will be unavoidable on the
RDN index. However maybe the heavy caching here might help speed things up.
No, because you still have to flush the data on disk. The problem is the 
same : O(n) operation, whatever we do.

Don't if this is the case but Emmanuel your tests on JDBM performance with
writes and caches etc might help here.

Maybe this is a horrible solution but I don't know just a brain dump.


I don't know how to transform a O(n) problem on O(1) either ;)



Regarding option 1 and 2, none is perfect.

Let's see the pros and cons for each of them

Solution 1 (update when manipulating the subentry)
--
pros :
- *very* efficient operations on normal entries, as they already have a
reference to all the subentries that select them.
- no extra cost when doing a rename, a modify, an add or a delete of a
normal entry.
- all the subentries are stored in a cache loaded at startup, no need to
fetch an index to get a subentry
- evaluating if an entry is selected is done when the subentry is added or
moved
- move operations applied on normal entries are rare
- any operation applied on a subentry are considered as admin operation,
thus unfrequent and scheduled. Downtime is under control.

cons :
- *very* costly add/delete/move/rename operations applied on a subentry, as
every selected entries will have to be modified
- *very* costly move operation of a normal entry if it moves from one AP to
another one, and when it has many children, as we have to update all the
children twice : first to reove the old subentries' reference, the second
one to add the

Re: Update about subtree problems

2010-07-20 Thread Alex Karasulu
On Tue, Jul 20, 2010 at 11:15 AM, Emmanuel Lecharny wrote:

>  Hi Howard,
>
> On 7/20/10 9:29 AM, Howard Chu wrote:Some side note :
>
>  after having done some perf tests on the evaluator, and applied some
>>> improvement, I can tell that depending on the number of subentries an
>>> entry is depending on, the cost of this evaluation can goes up to 50% of
>>> the search itself cost - not counting the network layer -. For instance,
>>> evaluating a subtreeSpecification with a min and a max, no chop, will be
>>> done up to 1 000 000 times per second on a 3 level DN (this is all
>>> dependent on the DN size)
>>>
>>
>> IMO, the considerations here are the same as for the O(1) rename. I.e.,
>> when you remove the entryDN from the entry in the DB, you have to calculate
>> the DN on the fly, and it certainly is a frequently referenced datum. You
>> make this cheap by caching the entryDN in memory, and it's very clear when a
>> cached DN must be invalidated - most of the time the cached value will not
>> change.
>>
> The DN cache is most certainly needed for faster operations. Building a DN
> on the fly for every entry is one of the most costly operation, so if we can
> speed it up with a cache, it's a net gain. Having the DN in the entry OTOH
> is not necessary a big gain : you still have to deserialize it if it's not
> in cache, and this is also costly.
>
> Obviously, all those considerations fell in a big dark hole if you have a
> decent entry cache, as the entries in memory already store the full DN...
> Any modification like a rename or a move will of course invalidate the
> entries in this cache.
>
> All in all, most of the case, you don't have to do all those
> computations...
>
> Regarding the subtree handling, it's different, as you can't spare the
> entry evaluation if the entries don't contain the reference to the subentry
> they depend upon. This evaluation can be costly, up to a point it's more
> expensive than fetching the entry itself.
>
> The rational being the choice I made 3 years ago (and which was reverted)
> to put the DN into the entry was just to speed up any search by avoiding
> costly computation at a price of costly unfrequent operations like Move or
> Rename (MODDN).
>
> If you have to move data in a Ldap base, User, then you have to pay the
> price !
>
>
Well yes but even renames cost the same as moves if the DN is in the entry.
Someone changing an ou=People to ou=Users containing 100 Million entries
should not expect to wait hours before it completes. Plus the atomicity
issue is seriously nasty. The DN embedded into the Entry was definitely not
the way to go. In fact Kiran and Seelmann's new RDN index to replace the DN
index saved us big time making these operations atomic, faster and safer.

-- 
Alex Karasulu
My Blog :: http://www.jroller.com/akarasulu/
Apache Directory Server :: http://directory.apache.org
Apache MINA :: http://mina.apache.org
To set up a meeting with me: http://tungle.me/AlexKarasulu


Re: Update about subtree problems

2010-07-20 Thread Alex Karasulu
Hi Emmanuel,

On Tue, Jul 20, 2010 at 2:34 AM, Emmanuel Lecharny wrote:

>  Hi guys,
>
> just an update about the current situation.
>
> We discussed on this ML about options regarding how to handle subtrees in
> the most efficient way in the server. We currently have 3 possibilities :
>
> - the first one, which is partially implemented in trunk, consist on static
> references to subentries in all the entries selected by the
> subtreeSpecifications defined in each subentry.
>
>
I'm supposing you're referring to static data references in partition
Entries which reference the subentries affecting the Entry using the proper
administrative operational attributes, rather than using static Java
references in the Entry class correct?


> - the second one suppose that we compute when needed if an entry is a
> selected entry, each time we access this entry
>
> - the third solution, which has not been discussed already, would be to
> have an index which stores the list of all the selected entries associated
> with each subentry.
>
> Let's first think about the third solution (this is the one implemented by
> OpenDS, btw). It's a costly one when you create a new subentry, or when you
> move one, as you have to update the index with all the impacted entries.
> Plus it also has a cost as you have to hit the index to get the list of
> subentries an entry is part of. Of course, we can use a cache, but I'm not
> sure that it will save a lot of time. In fact, AFAICT, you pay the price
> twice : when creating the subentry, and each time you access any entry.
>
> Ok, so option 3 does not seems to be reasonable. To be frank, we discussed
> about t with Pierre-Arnaud but not really considered this as a decent
> solution. We may be wrong though.
>
>
There might be one other option. How about adding the references to the
affecting subentries to the RDN index which replaced our old DN index?
Essentially we add an additional field to store these attributes in the RDN
value which we already have turned into a structure to hold the parent ID of
entries.

I think this is more valuable than a separate index because first we don't
need fully bidirectional lookups. We will not be looking up entries by the
subentry DN to get a list. We will be checking to see if each entry before
returning it or modifying it is impacted by subentries. Second this RDN
index will most likely be in memory. It's so central we should have it all
cached anyway. Sort of like what AD does with it's global catalog.

So carrying this additional information in the RDN index hence the DN, will
allow us to rapidly determine which subentries impact an Entry.

Now in terms of handling administrative changes we still have the same
issue: a potentially massive numbers of updates will be unavoidable on the
RDN index. However maybe the heavy caching here might help speed things up.
Don't if this is the case but Emmanuel your tests on JDBM performance with
writes and caches etc might help here.

Maybe this is a horrible solution but I don't know just a brain dump.


> Regarding option 1 and 2, none is perfect.
>
> Let's see the pros and cons for each of them
>
> Solution 1 (update when manipulating the subentry)
> --
> pros :
> - *very* efficient operations on normal entries, as they already have a
> reference to all the subentries that select them.
> - no extra cost when doing a rename, a modify, an add or a delete of a
> normal entry.
> - all the subentries are stored in a cache loaded at startup, no need to
> fetch an index to get a subentry
> - evaluating if an entry is selected is done when the subentry is added or
> moved
> - move operations applied on normal entries are rare
> - any operation applied on a subentry are considered as admin operation,
> thus unfrequent and scheduled. Downtime is under control.
>
> cons :
> - *very* costly add/delete/move/rename operations applied on a subentry, as
> every selected entries will have to be modified
> - *very* costly move operation of a normal entry if it moves from one AP to
> another one, and when it has many children, as we have to update all the
> children twice : first to reove the old subentries' reference, the second
> one to add the new references
> - Adding or removing a subentry is a non atomic operation, as we have first
> to update the selected entries one by one, and then to add/delete the
> subentry
>
>
Right we have to have some kind of atomicity control here. I wish we had
MVCC to assist us here.


> Solution 2 (evaluating when needed )
> 
> pros :
> - adding/removing/moving/renaming a subentry is a O(1) operation
> - moving normal entries selected by some subentries does not cost more than
> a standard move
> - easy to implement
>
> cons :
> - each time we access an entry (search, compare), we will have to evaluate
> the subtreeSpecifciation for all the subentries this entry is dependent on
> - we will need another cache to

Re: Update about subtree problems

2010-07-20 Thread Emmanuel Lecharny

 Hi Howard,

On 7/20/10 9:29 AM, Howard Chu wrote:Some side note :

after having done some perf tests on the evaluator, and applied some
improvement, I can tell that depending on the number of subentries an
entry is depending on, the cost of this evaluation can goes up to 50% of
the search itself cost - not counting the network layer -. For instance,
evaluating a subtreeSpecification with a min and a max, no chop, will be
done up to 1 000 000 times per second on a 3 level DN (this is all
dependent on the DN size)


IMO, the considerations here are the same as for the O(1) rename. 
I.e., when you remove the entryDN from the entry in the DB, you have 
to calculate the DN on the fly, and it certainly is a frequently 
referenced datum. You make this cheap by caching the entryDN in 
memory, and it's very clear when a cached DN must be invalidated - 
most of the time the cached value will not change. 
The DN cache is most certainly needed for faster operations. Building a 
DN on the fly for every entry is one of the most costly operation, so if 
we can speed it up with a cache, it's a net gain. Having the DN in the 
entry OTOH is not necessary a big gain : you still have to deserialize 
it if it's not in cache, and this is also costly.


Obviously, all those considerations fell in a big dark hole if you have 
a decent entry cache, as the entries in memory already store the full 
DN... Any modification like a rename or a move will of course invalidate 
the entries in this cache.


All in all, most of the case, you don't have to do all those 
computations...


Regarding the subtree handling, it's different, as you can't spare the 
entry evaluation if the entries don't contain the reference to the 
subentry they depend upon. This evaluation can be costly, up to a point 
it's more expensive than fetching the entry itself.


The rational being the choice I made 3 years ago (and which was 
reverted) to put the DN into the entry was just to speed up any search 
by avoiding costly computation at a price of costly unfrequent 
operations like Move or Rename (MODDN).


If you have to move data in a Ldap base, User, then you have to pay the 
price !


--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com



Re: Update about subtree problems

2010-07-20 Thread Howard Chu

Emmanuel Lecharny wrote:

Basically, sol.1 and sol.2 are both painful, but there is no perfect
solution anyway.

I thought about what is the best implementation all day long, and if
yesterday I was convinced that the solution 2 was better. But the
evaluation cost done on each entry every time a client does a search is
quite high (let's say up to 20% of the search cost, for simple SS). That
will not slow down the server that much, as most of the time is
consummed in the network layer right now, but this won't last.

Now I think that the current solution is may be a not so bad compromise,
assuming we don't do a lot of move operation cross APs on normal entries.

What bugs me here is that last year, for the opposite reason, the DN was
removed from the entries stored in the master table, just t be able to
do O(1) move operations. This is quite paradoxal ! I still think that
move operations are very rare, and that storing the DN into the entries
is a net gain for most of the operations, except for a move...


From another perspective, at least in the OpenLDAP case, O(1) rename 
operations were only one of the benefits of the latter approach. The other was 
the huge improvement in scalability of the DN index, which was quite bloated 
otherwise. When you see the opportunity to get more than one benefit from a 
particular approach, then it becomes a more compelling choice...



Some side note :
after having done some perf tests on the evaluator, and applied some
improvement, I can tell that depending on the number of subentries an
entry is depending on, the cost of this evaluation can goes up to 50% of
the search itself cost - not counting the network layer -. For instance,
evaluating a subtreeSpecification with a min and a max, no chop, will be
done up to 1 000 000 times per second on a 3 level DN (this is all
dependent on the DN size)


IMO, the considerations here are the same as for the O(1) rename. I.e., when 
you remove the entryDN from the entry in the DB, you have to calculate the DN 
on the fly, and it certainly is a frequently referenced datum. You make this 
cheap by caching the entryDN in memory, and it's very clear when a cached DN 
must be invalidated - most of the time the cached value will not change. You 
have potentially increased the cost of a read operation, but in practice the  
cost is zero, while the savings on the write side is significant.



Last, not least : the current implementation is really incomplete. The
Move, Rename and MoveAndRename operations are not correctly handled,
with many entries not being updated. I'm going to fix them. I have also
created a branch to play with subtree without breaking the trunk. I'm
not sure I will continue to work on this branch if a decision is  made
to keep the first solution.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/


Re: Update about subtree problems

2010-07-20 Thread Pierre-Arnaud Marcelot
Hi Emmanuel,

As we've seen together, that's not an easy subject and there's no real 
solution, just relative trade-offs.

I really believe that a good discussion on the ML is necessary in order to be 
sure about what type of operation is to be preferred in those kind of 
situations.

A community-wide consensus is really important and will help a lot simplifying 
the decisions in the future.

Regards,
Pierre-Arnaud


On 20 juil. 2010, at 01:34, Emmanuel Lecharny wrote:

> Hi guys,
> 
> just an update about the current situation.
> 
> We discussed on this ML about options regarding how to handle subtrees in the 
> most efficient way in the server. We currently have 3 possibilities :
> 
> - the first one, which is partially implemented in trunk, consist on static 
> references to subentries in all the entries selected by the 
> subtreeSpecifications defined in each subentry.
> 
> - the second one suppose that we compute when needed if an entry is a 
> selected entry, each time we access this entry
> 
> - the third solution, which has not been discussed already, would be to have 
> an index which stores the list of all the selected entries associated with 
> each subentry.
> 
> Let's first think about the third solution (this is the one implemented by 
> OpenDS, btw). It's a costly one when you create a new subentry, or when you 
> move one, as you have to update the index with all the impacted entries. Plus 
> it also has a cost as you have to hit the index to get the list of subentries 
> an entry is part of. Of course, we can use a cache, but I'm not sure that it 
> will save a lot of time. In fact, AFAICT, you pay the price twice : when 
> creating the subentry, and each time you access any entry.
> 
> Ok, so option 3 does not seems to be reasonable. To be frank, we discussed 
> about t with Pierre-Arnaud but not really considered this as a decent 
> solution. We may be wrong though.
> 
> Regarding option 1 and 2, none is perfect.
> 
> Let's see the pros and cons for each of them
> 
> Solution 1 (update when manipulating the subentry)
> --
> pros :
> - *very* efficient operations on normal entries, as they already have a 
> reference to all the subentries that select them.
> - no extra cost when doing a rename, a modify, an add or a delete of a normal 
> entry.
> - all the subentries are stored in a cache loaded at startup, no need to 
> fetch an index to get a subentry
> - evaluating if an entry is selected is done when the subentry is added or 
> moved
> - move operations applied on normal entries are rare
> - any operation applied on a subentry are considered as admin operation, thus 
> unfrequent and scheduled. Downtime is under control.
> 
> cons :
> - *very* costly add/delete/move/rename operations applied on a subentry, as 
> every selected entries will have to be modified
> - *very* costly move operation of a normal entry if it moves from one AP to 
> another one, and when it has many children, as we have to update all the 
> children twice : first to reove the old subentries' reference, the second one 
> to add the new references
> - Adding or removing a subentry is a non atomic operation, as we have first 
> to update the selected entries one by one, and then to add/delete the subentry
> 
> Solution 2 (evaluating when needed )
> 
> pros :
> - adding/removing/moving/renaming a subentry is a O(1) operation
> - moving normal entries selected by some subentries does not cost more than a 
> standard move
> - easy to implement
> 
> cons :
> - each time we access an entry (search, compare), we will have to evaluate 
> the subtreeSpecifciation for all the subentries this entry is dependent on
> - we will need another cache to lookup for the entry's related subentries 
> (this cache will work pretty much like the one we use for the NamingContexts)
> 
> 
> Basically, sol.1 and sol.2 are both painful, but there is no perfect solution 
> anyway.
> 
> I thought about what is the best implementation all day long, and if 
> yesterday I was convinced that the solution 2 was better. But the evaluation 
> cost done on each entry every time a client does a search is quite high 
> (let's say up to 20% of the search cost, for simple SS). That will not slow 
> down the server that much, as most of the time is consummed in the network 
> layer right now, but this won't last.
> 
> Now I think that the current solution is may be a not so bad compromise, 
> assuming we don't do a lot of move operation cross APs on normal entries.
> 
> What bugs me here is that last year, for the opposite reason, the DN was 
> removed from the entries stored in the master table, just t be able to do 
> O(1) move operations. This is quite paradoxal ! I still think that move 
> operations are very rare, and that storing the DN into the entries is a net 
> gain for most of the operations, except for a move...
> 
> 
> Some side note :
> after having done some