Re: [jr3] Flat hierarchy
On 18 Feb 2010, at 13:04, Stefan Guggisberg wrote: On Wed, Feb 17, 2010 at 5:17 PM, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, This one's a quite frequently asked feature. Currently we store the full list of child nodes in the same bundle or node state record with the parent node. This makes it expensive to support nodes with large amounts (1k) of child nodes. in my experience notable write-performance impacts start with 10-20k child nodes ;) 10-20k is quite a lot IMO, most applications probably won't ever hit this 'limit'. I am not arguing for flat hierarchies but I spend a lot of time persuading UX designers and management that 100k or more child nodes in not acceptable in a hierarchical content system. With an instance supporting 4M identified users with a unique user name, it hard to argue that /home/uid is not acceptable when there are so many other systems that support this. (we have modified the UserManager to scale) i agree that flat hierarchies is an important feature, however we shouldn't compromise the performance for the 'standard' case with less than 1k child nodes just for the sake of supporting flat hierarchies. after all, the JCR api is inherently hierarchic. maybe we should provide separate node types or node type attributes to indicate that instances of that type may have a large number of child nodes. nodes of that node type would be represented/handled differently in the persistence layer. This is the approach that we have taken, however its not without problems when trying to discover whats in /home Ian cheers stefan The obvious solution to this problem is to use a data structure like a B-tree to keep track of the child node entries. Doing this efficiently while still supporting both orderable nodes and same-name-siblings (i.e. nt:unstructured) probably requires some deeper thought. An added benefit would be the ability to do this on top of an append-only storage model. Any other ideas on how we could do this? BR, Jukka Zitting
Re: [jr3] Flat hierarchy
About micro-kernel: I think the micro-kernel shouldn't have to support splitting large child lists into smaller lists. That could be done on a higher level. What it should support is all the features required by this mechanism: - Support 'hidden' nodes (those are marked using a hidden property). That means the path doesn't always map directly to the stored structure. Therefore the micro-kernel should not be directly responsible to build or interpret JCR paths (micro-kernel path are similar but they don't always match the JCR path). - The entry in the child node list may contain multiple properties (in most cases the name, and the child node id; but sometimes also the reference of the next child node). The number of properties for each entry is the same however. For sorting, only the first element is relevant. - The child node list can always be stored in sorted order. But this sorting doesn't always map to the JCR child node list. Regards, Thomas
Re: [jr3] Flat hierarchy
Hi, I would also use a b-tree structure. If a node has too many child nodes, two new invisible internal nodes are created, and the list of child nodes is split up. Those internal nodes wouldn't have any properties. For efficient path lookup, the child node list should be sorted by name. This is a bit tricky. Currently, when adding a node, it is added as the last child. I suggest to change that behavior, and add it at the right place by default (so that the sort order is preserved). Like this, a path lookup is very fast even if there are many child nodes (binary search / b-tree). Is that an acceptable change (usability and spec conform)? If the user changes the child node order (manually re-ordered the nodes), then the sort order is broken. Then the path lookup has to scan through all nodes. While that's much slower, I think it's acceptable. One alternative is to use a linked list (each child node points to the next child node), which is very problematic for sharable nodes. So there would be a hidden flag 'child nodes are sorted by name'. Regards, Thomas
Re: [jr3] Flat hierarchy
On Thu, Feb 18, 2010 at 12:34, Thomas Müller thomas.muel...@day.com wrote: I would also use a b-tree structure. If a node has too many child nodes, two new invisible internal nodes are created, and the list of child nodes is split up. Those internal nodes wouldn't have any properties. You mean a b-tree for each node? I think this could be a separate index, but one for the whole tree. If the user changes the child node order (manually re-ordered the nodes), then the sort order is broken. Then the path lookup has to scan through all nodes. While that's much slower, I think it's acceptable. I think supporting fast path lookups for orderable child nodes is a bit more important than flat hierarchies, at least for CMS applications. So we should make it fast in that case as well. Regards, Alex -- Alexander Klimetschek alexander.klimetsc...@day.com
Re: [jr3] Flat hierarchy
Hi, I would also use a b-tree structure. If a node has too many child nodes, two new invisible internal nodes are created, and the list of child nodes is split up. Those internal nodes wouldn't have any properties. You mean a b-tree for each node? I think this could be a separate index, but one for the whole tree. The repository is one large b-tree, and each JCR node is a b-tree node (except for the JCR nodes that don't have any child nodes: those are b-tree leaves). If a JCR node has many child nodes, then there is at least one more level of b-tree nodes between the node and the child nodes. I think supporting fast path lookups for orderable child nodes is a bit more important than flat hierarchies Path lookups would still be fast (the same speed as now), except for large child node lists that were re-ordered. The difference is only for large child node lists. There is a difference between 'orderable' nodes (have the ability to reorder the child node list) and actually 're-ordered' child node lists. Is it acceptable if new nodes appear in lexicographic order in the child node list? Regards, Thomas
Re: [jr3] Flat hierarchy
On Wed, Feb 17, 2010 at 5:17 PM, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, This one's a quite frequently asked feature. Currently we store the full list of child nodes in the same bundle or node state record with the parent node. This makes it expensive to support nodes with large amounts (1k) of child nodes. in my experience notable write-performance impacts start with 10-20k child nodes ;) 10-20k is quite a lot IMO, most applications probably won't ever hit this 'limit'. i agree that flat hierarchies is an important feature, however we shouldn't compromise the performance for the 'standard' case with less than 1k child nodes just for the sake of supporting flat hierarchies. after all, the JCR api is inherently hierarchic. maybe we should provide separate node types or node type attributes to indicate that instances of that type may have a large number of child nodes. nodes of that node type would be represented/handled differently in the persistence layer. cheers stefan The obvious solution to this problem is to use a data structure like a B-tree to keep track of the child node entries. Doing this efficiently while still supporting both orderable nodes and same-name-siblings (i.e. nt:unstructured) probably requires some deeper thought. An added benefit would be the ability to do this on top of an append-only storage model. Any other ideas on how we could do this? BR, Jukka Zitting
Re: [jr3] Flat hierarchy
Hi, On Thu, Feb 18, 2010 at 2:01 PM, Thomas Müller thomas.muel...@day.com wrote: The repository is one large b-tree, and each JCR node is a b-tree node (except for the JCR nodes that don't have any child nodes: those are b-tree leaves). I don't think we can model the entire repository as a B-tree (it's certainly a tree, but the B-tree restructuring and balancing features don't apply). Instead I'd store the child node lists of each node as separate B-trees. Path lookups would still be fast (the same speed as now), except for large child node lists that were re-ordered. The difference is only for large child node lists. There is a difference between 'orderable' nodes (have the ability to reorder the child node list) and actually 're-ordered' child node lists. Is it acceptable if new nodes appear in lexicographic order in the child node list? Instead of modifying the standard insertion order semantics for orderable nodes, we could add a next pointer to the B-tree entries to overlay a linked list on top of the B-tree structure. This makes the data structure more complex but allows us to maintain support for orderable nodes. BR, Jukka Zitting
Re: [jr3] Flat hierarchy
Hi, A Jackrabbit repository is some kind of b-tree - just that the pages never split and balanced automatically. Maybe using b-tree is confusing? Let's call it manual b-tree then. i agree that flat hierarchies is an important feature, however we shouldn't compromise the performance for the 'standard' case with less than 1k child nodes I agree. Using the b-tree style wouldn't slow down the standard case. In the standard case things would stay exactly like they are now. add a next pointer This makes the data structure more complex but allows us to maintain support for orderable nodes. That's definitely an option. I just wonder if it's really required. I guess we will find out. Regards, Thomas
Re: [jr3] Flat hierarchy
I think Jukka is correct that the correct use of B-trees is to use one for each list of child nodes, not as a way to model the entire hierarchy. I've seen this implementation approach used for similar use cases, so I know that it can work. And there are modifications that can easily be applied to B-trees that deal with arbitrary (not based on a key) ordering of the nodes, even allowing you to skip to the n-th child efficiently (without having to traverse the entire tree). The part that's not clear to me is how this can be efficiently combined with an append-only storage format that's being discussed on the [jr3] Unified persistence thread. It wouldn't be good if every time a list of children is modified the persistence layer has to make a complete copy of the modified B-tree, as that would defeat the purpose of use a b-tree in the first place, wouldn't it? Jeff -- View this message in context: http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560288.html Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
Re: [jr3] Flat hierarchy
Hi, I think Jukka is correct that the correct use of B-trees is to use one for each list of child nodes, not as a way to model the entire hierarchy. If you are more comfortable with this view, that's OK. I probably should have said: the whole repository is a tree data structure. And there are modifications that can easily be applied to B-trees that deal with arbitrary (not based on a key) ordering of the nodes Sure. Jackrabbit needs a way to quickly navigate to a node by path. For that, you have to traverse the nodes and for each node you have to find the correct child. To do that, it's better if the child node list is ordered by name. Otherwise you have to iterate over all child nodes until you find the right one. Or you need a secondary index (Lucene?). And that's no matter if it's using a b-tree internally or not. The part that's not clear to me is how this can be efficiently combined with an append-only storage format that's being discussed on the [jr3] Unified persistence thread. It wouldn't be good if every time a list of children is modified the persistence layer has to make a complete copy of the modified B-tree You only have to update the b-tree node that is modified. That may be a hidden node (internal, hidden b-tree node) or a real node. Regards, Thomas
Re: [jr3] Flat hierarchy
JCR requires lookup of children by name and/or position (for orderable children), so the implementation needs to support all these cases efficiently. The trickiest one to handle is probably Node.getNodes(String namePattern) because it requires using both name and position together. Jeff -- View this message in context: http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560524.html Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
Re: [jr3] Flat hierarchy
Hi, JCR requires lookup of children by name and/or position (for orderable children), so the implementation needs to support all these cases efficiently. The trickiest one to handle is probably Node.getNodes(String namePattern) because it requires using both name and position together. While it's true that all that needs to be supported, I doubt that we should try to optimize for all cases. Otherwise the normal case will be slower. Usually, there are not that many child nodes. In that case lookup is not a problem: both a array and a hash map can be used (in memory). If there are many child nodes, then we should try to optimize for the most important case. I think it doesn't make sense to optimize for the case that the long list (many thousand) children are manually re-ordered (using orderBefore). Regards, Thomas
Re: [jr3] Flat hierarchy
Thomas Müller-2 wrote: If there are many child nodes, then we should try to optimize for the most important case. I think it doesn't make sense to optimize for the case that the long list (many thousand) children are manually re-ordered (using orderBefore). Even without using orderBefore, the specification still requires a stable ordering of children for nodes that support orderable child nodes (see 23.1 of JCR 2.0 spec). One possibility is to limit the use of the B-tree to nodes that do not support orderable children, and fall back to the simple array for orderable children. Of the use cases I'm aware of for flat hierarchies, none require an ordering of the children anyway, so this might be an acceptable limitation in the implementation. Jeff -- View this message in context: http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560650.html Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
Re: [jr3] Flat hierarchy
Hi, Even without using orderBefore, the specification still requires a stable ordering of children for nodes that support orderable child nodes (see 23.1 of JCR 2.0 spec). Thanks for the link! I see now my idea violates the specification which says (23.3) When a child node is added to a node that has orderable child nodes it is added to the end of the list. My idea was to add the child node according to its name (until the order is changed using orderBefore). One possibility is to limit the use of the B-tree to nodes that do not support orderable children You are right. Unfortunately orderable child nodes is the default. Another solution is to keep a linked list from child node entry to child node entry (only in this case). Let's see how complicated that is. By the way same name siblings would work fine (no linked list required). Regards, Thomas
Re: [jr3] Flat hierarchy
Thomas Müller-2 wrote: You are right. Unfortunately orderable child nodes is the default. Where in the spec does it say what the default is for orderable child nodes? The only place I see it is in the CND for nt:unstructured. Or is this a Jackrabbit implementation default that users now rely on? Jeff -- View this message in context: http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560831.html Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
Re: [jr3] Flat hierarchy
On Thu, Feb 18, 2010 at 22:04, Jeff Yemin jeff.ye...@gmail.com wrote: Thomas Müller-2 wrote: You are right. Unfortunately orderable child nodes is the default. Where in the spec does it say what the default is for orderable child nodes? The only place I see it is in the CND for nt:unstructured. Or is this a Jackrabbit implementation default that users now rely on? I think Thomas refers to the commonly used nt:unstructured that has orderable child nodes. Even if you inherit from it, you can't change that behavior. Of the use cases I'm aware of for flat hierarchies, none require an ordering of the children anyway, so this might be an acceptable limitation in the implementation. Right, there are either a low number of nodes that might be orderable or a large number of nodes, that are unordered. At least these are the cases that jr3 should be optimized for. So for the unordered case, one should use a node type that does not have orderable child nodes. If someone choses to add a lot of nodes to an orderable child nodes list, it is ok to be slow. Regards, Alex -- Alexander Klimetschek alexander.klimetsc...@day.com
Re: [jr3] Flat hierarchy
On 2/18/10 4:27 PM, Alexander Klimetschek wrote: On Thu, Feb 18, 2010 at 22:04, Jeff Yemin jeff.ye...@gmail.com wrote: Thomas Müller-2 wrote: You are right. Unfortunately orderable child nodes is the default. Where in the spec does it say what the default is for orderable child nodes? The only place I see it is in the CND for nt:unstructured. Or is this a Jackrabbit implementation default that users now rely on? I think Thomas refers to the commonly used nt:unstructured that has orderable child nodes. Even if you inherit from it, you can't change that behavior. Perhaps JR3 should define a node type called unstructured-unordered :) Would be non-standard, of course. Justin Of the use cases I'm aware of for flat hierarchies, none require an ordering of the children anyway, so this might be an acceptable limitation in the implementation. Right, there are either a low number of nodes that might be orderable or a large number of nodes, that are unordered. At least these are the cases that jr3 should be optimized for. So for the unordered case, one should use a node type that does not have orderable child nodes. If someone choses to add a lot of nodes to an orderable child nodes list, it is ok to be slow. Regards, Alex
Re: [jr3] Flat hierarchy
On Thu, Feb 18, 2010 at 22:36, Justin Edelson justinedel...@gmail.com wrote: Perhaps JR3 should define a node type called unstructured-unordered :) Would be non-standard, of course. He he ;-) But this could simply be documented clearly, eg. in the FAQ, and provide a sample node type my:bigpile that has residual child node definitions, but no orderable flag. Regards, Alex -- Alexander Klimetschek alexander.klimetsc...@day.com
Re: [jr3] Flat hierarchy
Thomas Müller-2 wrote: The part that's not clear to me is how this can be efficiently combined with an append-only storage format that's being discussed on the [jr3] Unified persistence thread. It wouldn't be good if every time a list of children is modified the persistence layer has to make a complete copy of the modified B-tree You only have to update the b-tree node that is modified. That may be a hidden node (internal, hidden b-tree node) or a real node. In that case re-balancing of the tree will have to be taken into account, where multiple internal nodes of the tree can be modified as a result of an insertion or deletion. Also, it seems to me that technically we're probably talking about a B+ tree, as it makes sense that the real child nodes would all be pointed to from the leaf nodes. Jeff -- View this message in context: http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560979.html Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
[jr3] Flat hierarchy
Hi, This one's a quite frequently asked feature. Currently we store the full list of child nodes in the same bundle or node state record with the parent node. This makes it expensive to support nodes with large amounts (1k) of child nodes. The obvious solution to this problem is to use a data structure like a B-tree to keep track of the child node entries. Doing this efficiently while still supporting both orderable nodes and same-name-siblings (i.e. nt:unstructured) probably requires some deeper thought. An added benefit would be the ability to do this on top of an append-only storage model. Any other ideas on how we could do this? BR, Jukka Zitting