Re: [jr3] Flat hierarchy

2010-02-22 Thread Ian Boston

On 18 Feb 2010, at 13:04, Stefan Guggisberg wrote:

 On Wed, Feb 17, 2010 at 5:17 PM, Jukka Zitting jukka.zitt...@gmail.com 
 wrote:
 Hi,
 
 This one's a quite frequently asked feature. Currently we store the
 full list of child nodes in the same bundle or node state record with
 the parent node. This makes it expensive to support nodes with large
 amounts (1k) of child nodes.
 
 in my experience notable write-performance impacts start with 10-20k
 child nodes ;)
 10-20k is quite a lot IMO, most applications probably won't ever hit
 this 'limit'.

I am not arguing for flat hierarchies but I spend a lot of time persuading UX 
designers and management that 100k or more child nodes in not acceptable in a 
hierarchical content system.

With an instance supporting 4M identified users with a unique user name, it 
hard to argue that /home/uid is not acceptable when there are so many other 
systems that support this. (we have modified the UserManager to scale)

 
 i agree that flat hierarchies is an important feature, however we
 shouldn't compromise
 the performance for the 'standard' case with less than 1k child nodes
 just for the sake
 of supporting flat hierarchies. after all, the JCR api is inherently
 hierarchic.
 
 maybe we should provide separate node types or node type attributes to
 indicate that
 instances of  that type may have a large number of child nodes. nodes
 of that node
 type would be represented/handled differently in the persistence layer.

This is the approach that we have taken, however its not without problems when 
trying to discover whats in /home

Ian

 
 cheers
 stefan
 
 
 The obvious solution to this problem is to use a data structure like a
 B-tree to keep track of the child node entries. Doing this efficiently
 while still supporting both orderable nodes and same-name-siblings
 (i.e. nt:unstructured) probably requires some deeper thought. An added
 benefit would be the ability to do this on top of an append-only
 storage model.
 
 Any other ideas on how we could do this?
 
 BR,
 
 Jukka Zitting
 



Re: [jr3] Flat hierarchy

2010-02-19 Thread Thomas Müller
About micro-kernel: I think the micro-kernel shouldn't have to support
splitting large child lists into smaller lists. That could be done on
a higher level. What it should support is all the features required by
this mechanism:

- Support 'hidden' nodes (those are marked using a hidden property).
That means the path doesn't always map directly to the stored
structure. Therefore the micro-kernel should not be directly
responsible to build or interpret JCR paths (micro-kernel path are
similar but they don't always match the JCR path).

- The entry in the child node list may contain multiple properties (in
most cases the name, and the child node id; but sometimes also the
reference of the next child node). The number of properties for each
entry is the same however. For sorting, only the first element is
relevant.

- The child node list can always be stored in sorted order. But this
sorting doesn't always map to the JCR child node list.

Regards,
Thomas


Re: [jr3] Flat hierarchy

2010-02-18 Thread Thomas Müller
Hi,

I would also use a b-tree structure. If a node has too many child
nodes, two new invisible internal nodes are created, and the list of
child nodes is split up. Those internal nodes wouldn't have any
properties.

For efficient path lookup, the child node list should be sorted by
name. This is a bit tricky.

Currently, when adding a node, it is added as the last child. I
suggest to change that behavior, and add it at the right place by
default (so that the sort order is preserved). Like this, a path
lookup is very fast even if there are many child nodes (binary search
/ b-tree). Is that an acceptable change (usability and spec conform)?

If the user changes the child node order (manually re-ordered the
nodes), then the sort order is broken. Then the path lookup has to
scan through all nodes. While that's much slower, I think it's
acceptable. One alternative is to use a linked list (each child node
points to the next child node), which is very problematic for sharable
nodes.

So there would be a hidden flag 'child nodes are sorted by name'.

Regards,
Thomas


Re: [jr3] Flat hierarchy

2010-02-18 Thread Alexander Klimetschek
On Thu, Feb 18, 2010 at 12:34, Thomas Müller thomas.muel...@day.com wrote:
 I would also use a b-tree structure. If a node has too many child
 nodes, two new invisible internal nodes are created, and the list of
 child nodes is split up. Those internal nodes wouldn't have any
 properties.

You mean a b-tree for each node? I think this could be a separate
index, but one for the whole tree.

 If the user changes the child node order (manually re-ordered the
 nodes), then the sort order is broken. Then the path lookup has to
 scan through all nodes. While that's much slower, I think it's
 acceptable.

I think supporting fast path lookups for orderable child nodes is a
bit more important than flat hierarchies, at least for CMS
applications. So we should make it fast in that case as well.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetsc...@day.com


Re: [jr3] Flat hierarchy

2010-02-18 Thread Thomas Müller
Hi,

 I would also use a b-tree structure. If a node has too many child
 nodes, two new invisible internal nodes are created, and the list of
 child nodes is split up. Those internal nodes wouldn't have any
 properties.

 You mean a b-tree for each node? I think this could be a separate
 index, but one for the whole tree.

The repository is one large b-tree, and each JCR node is a b-tree node
(except for the JCR nodes that don't have any child nodes: those are
b-tree leaves). If a JCR node has many child nodes, then there is at
least one more level of b-tree nodes between the node and the child
nodes.

 I think supporting fast path lookups for orderable child nodes is a
 bit more important than flat hierarchies

Path lookups would still be fast (the same speed as now), except for
large child node lists that were re-ordered. The difference is only
for large child node lists. There is a difference between 'orderable'
nodes (have the ability to reorder the child node list) and actually
're-ordered' child node lists. Is it acceptable if new nodes appear in
lexicographic order in the child node list?

Regards,
Thomas


Re: [jr3] Flat hierarchy

2010-02-18 Thread Stefan Guggisberg
On Wed, Feb 17, 2010 at 5:17 PM, Jukka Zitting jukka.zitt...@gmail.com wrote:
 Hi,

 This one's a quite frequently asked feature. Currently we store the
 full list of child nodes in the same bundle or node state record with
 the parent node. This makes it expensive to support nodes with large
 amounts (1k) of child nodes.

in my experience notable write-performance impacts start with 10-20k
child nodes ;)
10-20k is quite a lot IMO, most applications probably won't ever hit
this 'limit'.

i agree that flat hierarchies is an important feature, however we
shouldn't compromise
the performance for the 'standard' case with less than 1k child nodes
just for the sake
of supporting flat hierarchies. after all, the JCR api is inherently
hierarchic.

maybe we should provide separate node types or node type attributes to
indicate that
instances of  that type may have a large number of child nodes. nodes
of that node
type would be represented/handled differently in the persistence layer.

cheers
stefan


 The obvious solution to this problem is to use a data structure like a
 B-tree to keep track of the child node entries. Doing this efficiently
 while still supporting both orderable nodes and same-name-siblings
 (i.e. nt:unstructured) probably requires some deeper thought. An added
 benefit would be the ability to do this on top of an append-only
 storage model.

 Any other ideas on how we could do this?

 BR,

 Jukka Zitting



Re: [jr3] Flat hierarchy

2010-02-18 Thread Jukka Zitting
Hi,

On Thu, Feb 18, 2010 at 2:01 PM, Thomas Müller thomas.muel...@day.com wrote:
 The repository is one large b-tree, and each JCR node is a b-tree node
 (except for the JCR nodes that don't have any child nodes: those are
 b-tree leaves).

I don't think we can model the entire repository as a B-tree (it's
certainly a tree, but the B-tree restructuring and balancing features
don't apply). Instead I'd store the child node lists of each node as
separate B-trees.

 Path lookups would still be fast (the same speed as now), except for
 large child node lists that were re-ordered. The difference is only
 for large child node lists. There is a difference between 'orderable'
 nodes (have the ability to reorder the child node list) and actually
 're-ordered' child node lists. Is it acceptable if new nodes appear in
 lexicographic order in the child node list?

Instead of modifying the standard insertion order semantics for
orderable nodes, we could add a next pointer to the B-tree entries
to overlay a linked list on top of the B-tree structure. This makes
the data structure more complex but allows us to maintain support for
orderable nodes.

BR,

Jukka Zitting


Re: [jr3] Flat hierarchy

2010-02-18 Thread Thomas Müller
Hi,

A Jackrabbit repository is some kind of b-tree - just that the pages
never split and balanced automatically. Maybe using b-tree is
confusing? Let's call it manual b-tree then.

 i agree that flat hierarchies is an important feature, however we
 shouldn't compromise
 the performance for the 'standard' case with less than 1k child nodes

I agree. Using the b-tree style wouldn't slow down the standard case.
In the standard case things would stay exactly like they are now.

 add a next pointer
 This makes the data structure more complex but allows us to maintain support 
 for orderable nodes.

That's definitely an option. I just wonder if it's really required. I
guess we will find out.

Regards,
Thomas


Re: [jr3] Flat hierarchy

2010-02-18 Thread Jeff Yemin

I think Jukka is correct that the correct use of B-trees is to use one for
each list of child nodes, not as a way to model the entire hierarchy.  I've
seen this implementation approach used for similar use cases, so I know that
it can work.  And there are modifications that can easily be applied to
B-trees that deal with arbitrary (not based on a key) ordering of the nodes,
even allowing you to skip to the n-th child efficiently (without having to
traverse the entire tree).

The part that's not clear to me is how this can be efficiently combined with
an append-only storage format that's being discussed on the [jr3] Unified
persistence thread.  It wouldn't be good if every time a list of children
is modified the persistence layer has to make a complete copy of the
modified B-tree, as that would defeat the purpose of use a b-tree in the
first place, wouldn't it?

Jeff
-- 
View this message in context: 
http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560288.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


Re: [jr3] Flat hierarchy

2010-02-18 Thread Thomas Müller
Hi,

 I think Jukka is correct that the correct use of B-trees is to use one for
 each list of child nodes, not as a way to model the entire hierarchy.

If you are more comfortable with this view, that's OK. I probably
should have said: the whole repository is a tree data structure.

 And there are modifications that can easily be applied to
 B-trees that deal with arbitrary (not based on a key) ordering of the nodes

Sure. Jackrabbit needs a way to quickly navigate to a node by path.
For that, you have to traverse the nodes and for each node you have to
find the correct child. To do that, it's better if the child node list
is ordered by name. Otherwise you have to iterate over all child nodes
until you find the right one. Or you need a secondary index (Lucene?).
And that's no matter if it's using a b-tree internally or not.

 The part that's not clear to me is how this can be efficiently combined with
 an append-only storage format that's being discussed on the [jr3] Unified
 persistence thread.  It wouldn't be good if every time a list of children
 is modified the persistence layer has to make a complete copy of the
 modified B-tree

You only have to update the b-tree node that is modified. That may be
a hidden node (internal, hidden b-tree node) or a real node.

Regards,
Thomas


Re: [jr3] Flat hierarchy

2010-02-18 Thread Jeff Yemin

JCR requires lookup of children by name and/or position (for orderable
children), so the implementation needs to support all these cases
efficiently.  The trickiest one to handle is probably Node.getNodes(String
namePattern) because it requires using both name and position together.

Jeff

-- 
View this message in context: 
http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560524.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


Re: [jr3] Flat hierarchy

2010-02-18 Thread Thomas Müller
Hi,

 JCR requires lookup of children by name and/or position (for orderable
 children), so the implementation needs to support all these cases
 efficiently.  The trickiest one to handle is probably Node.getNodes(String
 namePattern) because it requires using both name and position together.

While it's true that all that needs to be supported, I doubt that we
should try to optimize for all cases. Otherwise the normal case will
be slower.

Usually, there are not that many child nodes. In that case lookup is
not a problem: both a array and a hash map can be used (in memory).

If there are many child nodes, then we should try to optimize for the
most important case. I think it doesn't make sense to optimize for the
case that the long list (many thousand) children are manually
re-ordered (using orderBefore).

Regards,
Thomas


Re: [jr3] Flat hierarchy

2010-02-18 Thread Jeff Yemin


Thomas Müller-2 wrote:
 
 If there are many child nodes, then we should try to optimize for the
 most important case. I think it doesn't make sense to optimize for the
 case that the long list (many thousand) children are manually
 re-ordered (using orderBefore).
 

Even without using orderBefore, the specification still requires a stable
ordering of children for nodes that support orderable child nodes (see 23.1
of JCR 2.0 spec).  One possibility is to limit the use of the B-tree to
nodes that do not support orderable children, and fall back to the simple
array for orderable children.  Of the use cases I'm aware of for flat
hierarchies, none require an ordering of the children anyway, so this might
be an acceptable limitation in the implementation.

Jeff

-- 
View this message in context: 
http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560650.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


Re: [jr3] Flat hierarchy

2010-02-18 Thread Thomas Müller
Hi,

 Even without using orderBefore, the specification still requires a stable
 ordering of children for nodes that support orderable child nodes (see 23.1
 of JCR 2.0 spec).

Thanks for the link! I see now my idea violates the specification
which says (23.3) When a child node is added to a node that has
orderable child nodes it is added to the end of the list. My idea was
to add the child node according to its name (until the order is
changed using orderBefore).

 One possibility is to limit the use of the B-tree to nodes that do not 
 support orderable children

You are right. Unfortunately orderable child nodes is the default.

Another solution is to keep a linked list from child node entry to
child node entry (only in this case). Let's see how complicated that
is.

By the way same name siblings would work fine (no linked list required).

Regards,
Thomas


Re: [jr3] Flat hierarchy

2010-02-18 Thread Jeff Yemin


Thomas Müller-2 wrote:
 
 You are right. Unfortunately orderable child nodes is the default.
 

Where in the spec does it say what the default is for orderable child nodes? 
The only place I see it is in the CND for nt:unstructured.  Or is this a
Jackrabbit implementation default that users now rely on?

Jeff

-- 
View this message in context: 
http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560831.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


Re: [jr3] Flat hierarchy

2010-02-18 Thread Alexander Klimetschek
On Thu, Feb 18, 2010 at 22:04, Jeff Yemin jeff.ye...@gmail.com wrote:

 Thomas Müller-2 wrote:
 You are right. Unfortunately orderable child nodes is the default.

 Where in the spec does it say what the default is for orderable child nodes?
 The only place I see it is in the CND for nt:unstructured.  Or is this a
 Jackrabbit implementation default that users now rely on?

I think Thomas refers to the commonly used nt:unstructured that has
orderable child nodes. Even if you inherit from it, you can't change
that behavior.


 Of the use cases I'm aware of for flat
 hierarchies, none require an ordering of the children anyway, so this might
 be an acceptable limitation in the implementation.

Right, there are either a low number of nodes that might be orderable
or a large number of nodes, that are unordered. At least these are the
cases that jr3 should be optimized for.

So for the unordered case, one should use a node type that does not
have orderable child nodes. If someone choses to add a lot of nodes to
an orderable child nodes list, it is ok to be slow.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetsc...@day.com


Re: [jr3] Flat hierarchy

2010-02-18 Thread Justin Edelson
On 2/18/10 4:27 PM, Alexander Klimetschek wrote:
 On Thu, Feb 18, 2010 at 22:04, Jeff Yemin jeff.ye...@gmail.com wrote:

 Thomas Müller-2 wrote:
 You are right. Unfortunately orderable child nodes is the default.

 Where in the spec does it say what the default is for orderable child nodes?
 The only place I see it is in the CND for nt:unstructured.  Or is this a
 Jackrabbit implementation default that users now rely on?
 
 I think Thomas refers to the commonly used nt:unstructured that has
 orderable child nodes. Even if you inherit from it, you can't change
 that behavior.

Perhaps JR3 should define a node type called unstructured-unordered :)
Would be non-standard, of course.

Justin

 
 
 Of the use cases I'm aware of for flat
 hierarchies, none require an ordering of the children anyway, so this might
 be an acceptable limitation in the implementation.
 
 Right, there are either a low number of nodes that might be orderable
 or a large number of nodes, that are unordered. At least these are the
 cases that jr3 should be optimized for.
 
 So for the unordered case, one should use a node type that does not
 have orderable child nodes. If someone choses to add a lot of nodes to
 an orderable child nodes list, it is ok to be slow.
 
 Regards,
 Alex
 



Re: [jr3] Flat hierarchy

2010-02-18 Thread Alexander Klimetschek
On Thu, Feb 18, 2010 at 22:36, Justin Edelson justinedel...@gmail.com wrote:
 Perhaps JR3 should define a node type called unstructured-unordered :)
 Would be non-standard, of course.

He he ;-) But this could simply be documented clearly, eg. in the FAQ,
and provide a sample node type my:bigpile that has residual child
node definitions, but no orderable flag.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetsc...@day.com


Re: [jr3] Flat hierarchy

2010-02-18 Thread Jeff Yemin


Thomas Müller-2 wrote:
 
 The part that's not clear to me is how this can be efficiently combined
 with
 an append-only storage format that's being discussed on the [jr3]
 Unified
 persistence thread.  It wouldn't be good if every time a list of
 children
 is modified the persistence layer has to make a complete copy of the
 modified B-tree
 
 You only have to update the b-tree node that is modified. That may be
 a hidden node (internal, hidden b-tree node) or a real node.
 

In that case re-balancing of the tree will have to be taken into account,
where multiple internal nodes of the tree can be modified as a result of an
insertion or deletion.

Also, it seems to me that technically we're probably talking about a B+
tree, as it makes sense that the real child nodes would all be pointed to
from the leaf nodes.

Jeff

-- 
View this message in context: 
http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560979.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


[jr3] Flat hierarchy

2010-02-17 Thread Jukka Zitting
Hi,

This one's a quite frequently asked feature. Currently we store the
full list of child nodes in the same bundle or node state record with
the parent node. This makes it expensive to support nodes with large
amounts (1k) of child nodes.

The obvious solution to this problem is to use a data structure like a
B-tree to keep track of the child node entries. Doing this efficiently
while still supporting both orderable nodes and same-name-siblings
(i.e. nt:unstructured) probably requires some deeper thought. An added
benefit would be the ability to do this on top of an append-only
storage model.

Any other ideas on how we could do this?

BR,

Jukka Zitting