[jira] Commented: (JCR-2857) Support sequential (non-random) node ids
[ https://issues.apache.org/jira/browse/JCR-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977334#action_12977334 ] Michael Dürig commented on JCR-2857: I'm a bit skeptical whether generating sequential node ids is the right approach in general. It relies on assumptions about the underlying persistent store and exposes these throughout the implementation and partially up to the JCR api. I'd rather separate the concepts 'identifier' and 'locality'. That is, I wouldn't encode locality hints into the identifiers directly (like you do) but pass some locality hint (i.e. which nodes are likely to be accessed together) to the persistent store. The later can use these and its knowledge about the characteristics of the storage mechanism to optimize access. For the original discussion see http://markmail.org/thread/3jzqjy6cavxcrpbq > Support sequential (non-random) node ids > > > Key: JCR-2857 > URL: https://issues.apache.org/jira/browse/JCR-2857 > Project: Jackrabbit Content Repository > Issue Type: Improvement > Components: jackrabbit-core >Reporter: Thomas Mueller >Assignee: Thomas Mueller > Attachments: jcr-2857.patch > > > Currently, node ids are generated using a (cryptographically secure pseudo-) > random number generator. This has a many advantages (easy to implement, easy > to merge nodes from multiple repositories or cluster nodes), but is a > performance bottleneck for large repositories. > In addition to generating random node ids, Jackrabbit should support > generating sequential node ids. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-2857) Support sequential (non-random) node ids
[ https://issues.apache.org/jira/browse/JCR-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977678#action_12977678 ] Thomas Mueller commented on JCR-2857: - Sequential node ids are much faster than random node ids. I can't think of *any* case where random ids are faster. For the 'append only' use case, I believe sequential node ids are the fastest possible solution. In many (most?) cases multiple nodes are created at a time (example: nt:file / nt:resource). Those node groups are then often accessed at the same time. Even if only two nodes are generated at a time on average, the node id index is twice as efficient when using sequential node ids. Sequential node ids don't 'expose' anything. They only improve the performance characteristic. How node ids are generated doesn't affect the API in any way. The only problem with sequential node ids (that I know of) is when using a cluster, and when importing nodes from other repositories with trying to preserve the node ids. For such cases, random node ids are easier to work with. How much sequential node ids will affect real uses cases is not clear. This needs to be tested. Testing it is much simpler if Jackrabbit supports sequential node ids in the default build (but of course disabled by default). Therefore, unless there is strong opposition, I will to apply my patch in the next days. Possible enhancements: Support configuring the most significant bits in the repository.xml, or take the most significant bits from the unique repository id / cluster node id. Implement a 'cluster aware' node id generator that doesn't need configuration. > Support sequential (non-random) node ids > > > Key: JCR-2857 > URL: https://issues.apache.org/jira/browse/JCR-2857 > Project: Jackrabbit Content Repository > Issue Type: Improvement > Components: jackrabbit-core >Reporter: Thomas Mueller >Assignee: Thomas Mueller > Attachments: jcr-2857.patch > > > Currently, node ids are generated using a (cryptographically secure pseudo-) > random number generator. This has a many advantages (easy to implement, easy > to merge nodes from multiple repositories or cluster nodes), but is a > performance bottleneck for large repositories. > In addition to generating random node ids, Jackrabbit should support > generating sequential node ids. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-2857) Support sequential (non-random) node ids
[ https://issues.apache.org/jira/browse/JCR-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977893#action_12977893 ] Jukka Zitting commented on JCR-2857: The UUIDs of referenceable nodes must be globally unique to avoid problems when moving content between repositories. It must be possible to export a referenceable node from repository A and import it to repository B without worrying about UUID conflicts. We should be careful not to break this constraint, not even when a user makes a configuration mistake! One possible way to do this might be to generate a random UUID during startup and use that as the basis of an incremental sequence of identifiers. At the next startup a new random base UUID would get generated. I'm not sure what this approach would do to UUID collision statistics. On the code side I'd rather leave the UUID generation strategy up to the persistence manager implementation as it'll be best equipped to know what identifier distribution will work best with the underlying storage mechanism. Thus instead of a repository-wide NodeIdFactory, I'd add a createNodeId() factory method to the PersistenceManager interface and wire our code to use that method whenever a new identifier is needed. In the long run I agree with Michael about the need to keep UUIDs and storage locations as separate concepts. For example, we could look at turning the NodeId class into an opaque interface with no required relationship with the JCR UUIDs visible through the jcr:uuid property. Each persistence manager could then choose to store whatever information it likes in the NodeId instances it creates, and we could use separate UUID instances (or simply identifier strings) to track node references and for things like Session.getNodeByIdentifier(). > Support sequential (non-random) node ids > > > Key: JCR-2857 > URL: https://issues.apache.org/jira/browse/JCR-2857 > Project: Jackrabbit Content Repository > Issue Type: Improvement > Components: jackrabbit-core >Reporter: Thomas Mueller >Assignee: Thomas Mueller > Attachments: jcr-2857.patch > > > Currently, node ids are generated using a (cryptographically secure pseudo-) > random number generator. This has a many advantages (easy to implement, easy > to merge nodes from multiple repositories or cluster nodes), but is a > performance bottleneck for large repositories. > In addition to generating random node ids, Jackrabbit should support > generating sequential node ids. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-2857) Support sequential (non-random) node ids
[ https://issues.apache.org/jira/browse/JCR-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978215#action_12978215 ] Thomas Mueller commented on JCR-2857: - One solution to ensure uniqueness is to use a unique repository-wide id as the base or most significant bits of the node id. I think it's better to not change this base id on each startup. With the patch, it's already possible to emulate this (set the system property jackrabbit.sequentialNodeId to /, for example "14f0acef/0"). The patch let's you 'test drive' sequential node ids, and includes the necessary refactoring of the node id generation (the NodeIdFactory mechanism), but the patch doesn't generate a random base id automatically yet. I will change that: when the jackrabbit.sequentialNodeId is set to "true", use a random base id instead of 0/0. I agree in the long term, it makes sense to let the persistence layer generate unique node ids. In my J3 prototype this is already implemented. For the current Jackrabbit code, it would mean a lot of changes because each component would need to have a reference to the persistence layer, or let the persistence layer generate node id factories. But I don't think Jackrabbit would be much faster if the persistence layer generates the node ids - just it would make sense on an architecture level in the long term. But if we anyway want to replace the current Jackrabbit code with new code it doesn't make sense to change that now. > Support sequential (non-random) node ids > > > Key: JCR-2857 > URL: https://issues.apache.org/jira/browse/JCR-2857 > Project: Jackrabbit Content Repository > Issue Type: Improvement > Components: jackrabbit-core >Reporter: Thomas Mueller >Assignee: Thomas Mueller > Attachments: jcr-2857.patch > > > Currently, node ids are generated using a (cryptographically secure pseudo-) > random number generator. This has a many advantages (easy to implement, easy > to merge nodes from multiple repositories or cluster nodes), but is a > performance bottleneck for large repositories. > In addition to generating random node ids, Jackrabbit should support > generating sequential node ids. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (JCR-2857) Support sequential (non-random) node ids
[ https://issues.apache.org/jira/browse/JCR-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979597#action_12979597 ] Thomas Mueller commented on JCR-2857: - Revision 1057220 and revision 1057229. The feature is disabled by default. If the system property "jackrabbit.sequentialNodeId" is set to "true", then the most significant bits are set to a cryptographically secure random number, except for the bits that normally contain the UUID version number, which are set to 0 (so the node id can't clash with a version 1-5 UUID). That means the node id contains 56 bits of 'repository identifier'. This is good enough to run a few thousand cluster nodes; the probability of duplicate repository identifiers is about 0.0003 for 65536 repositories. But the feature doesn't provide the same guarantee for *globally* unique identifiers as normal UUIDs do. If such a guarantee is required, the most significant bits could be set to the MAC address (but in that case you could only use one repository per MAC address). > Support sequential (non-random) node ids > > > Key: JCR-2857 > URL: https://issues.apache.org/jira/browse/JCR-2857 > Project: Jackrabbit Content Repository > Issue Type: Improvement > Components: jackrabbit-core >Reporter: Thomas Mueller >Assignee: Thomas Mueller > Attachments: jcr-2857.patch > > > Currently, node ids are generated using a (cryptographically secure pseudo-) > random number generator. This has a many advantages (easy to implement, easy > to merge nodes from multiple repositories or cluster nodes), but is a > performance bottleneck for large repositories. > In addition to generating random node ids, Jackrabbit should support > generating sequential node ids. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.