[jira] Commented: (JCR-2857) Support sequential (non-random) node ids

2011-01-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/JCR-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977334#action_12977334
 ] 

Michael Dürig commented on JCR-2857:


I'm a bit skeptical whether generating sequential node ids is the right 
approach in general. It relies on assumptions about the underlying persistent 
store and exposes these throughout the implementation and partially up to the 
JCR api. I'd rather separate the concepts 'identifier' and 'locality'. That is, 
I wouldn't encode locality hints into the identifiers directly (like you do) 
but pass some locality hint (i.e. which nodes are likely to be accessed 
together) to the persistent store. The later can use these and its knowledge 
about the characteristics of the storage mechanism to optimize access. 

For the original discussion see http://markmail.org/thread/3jzqjy6cavxcrpbq

> Support sequential (non-random) node ids
> 
>
> Key: JCR-2857
> URL: https://issues.apache.org/jira/browse/JCR-2857
> Project: Jackrabbit Content Repository
>  Issue Type: Improvement
>  Components: jackrabbit-core
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
> Attachments: jcr-2857.patch
>
>
> Currently, node ids are generated using a (cryptographically secure pseudo-) 
> random number generator. This has a many advantages (easy to implement, easy 
> to merge nodes from multiple repositories or cluster nodes), but is a 
> performance bottleneck for large repositories.
> In addition to generating random node ids, Jackrabbit should support 
> generating sequential node ids.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (JCR-2857) Support sequential (non-random) node ids

2011-01-05 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/JCR-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977678#action_12977678
 ] 

Thomas Mueller commented on JCR-2857:
-

Sequential node ids are much faster than random node ids. I can't think of 
*any* case where random ids are faster. For the 'append only' use case, I 
believe sequential node ids are the fastest possible solution. 

In many (most?) cases multiple nodes are created at a time (example: nt:file / 
nt:resource). Those node groups are then often accessed at the same time. Even 
if only two nodes are generated at a time on average, the node id index is 
twice as efficient when using sequential node ids.

Sequential node ids don't 'expose' anything. They only improve the performance 
characteristic. How node ids are generated doesn't affect the API in any way. 
The only problem with sequential node ids (that I know of) is when using a 
cluster, and when importing nodes from other repositories with trying to 
preserve the node ids. For such cases, random node ids are easier to work with.

How much sequential node ids will affect real uses cases is not clear. This 
needs to be tested. Testing it is much simpler if Jackrabbit supports 
sequential node ids in the default build (but of course disabled by default). 
Therefore, unless there is strong opposition, I will to apply my patch in the 
next days.

Possible enhancements: Support configuring the most significant bits in the 
repository.xml, or take the most significant bits from the unique repository id 
/ cluster node id. Implement a 'cluster aware' node id generator that doesn't 
need configuration.


> Support sequential (non-random) node ids
> 
>
> Key: JCR-2857
> URL: https://issues.apache.org/jira/browse/JCR-2857
> Project: Jackrabbit Content Repository
>  Issue Type: Improvement
>  Components: jackrabbit-core
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
> Attachments: jcr-2857.patch
>
>
> Currently, node ids are generated using a (cryptographically secure pseudo-) 
> random number generator. This has a many advantages (easy to implement, easy 
> to merge nodes from multiple repositories or cluster nodes), but is a 
> performance bottleneck for large repositories.
> In addition to generating random node ids, Jackrabbit should support 
> generating sequential node ids.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (JCR-2857) Support sequential (non-random) node ids

2011-01-05 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/JCR-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977893#action_12977893
 ] 

Jukka Zitting commented on JCR-2857:


The UUIDs of referenceable nodes must be globally unique to avoid problems when 
moving content between repositories. It must be possible to export a 
referenceable node from repository A and import it to repository B without 
worrying about UUID conflicts. We should be careful not to break this 
constraint, not even when a user makes a configuration mistake!

One possible way to do this might be to generate a random UUID during startup 
and use that as the basis of an incremental sequence of identifiers. At the 
next startup a new random base UUID would get generated. I'm not sure what this 
approach would do to UUID collision statistics.

On the code side I'd rather leave the UUID generation strategy up to the 
persistence manager implementation as it'll be best equipped to know what 
identifier distribution will work best with the underlying storage mechanism. 
Thus instead of a repository-wide NodeIdFactory, I'd add a createNodeId() 
factory method to the PersistenceManager interface and wire our code to use 
that method whenever a new identifier is needed.

In the long run I agree with Michael about the need to keep UUIDs and storage 
locations as separate concepts. For example, we could look at turning the 
NodeId class into an opaque interface with no required relationship with the 
JCR UUIDs visible through the jcr:uuid property. Each persistence manager could 
then choose to store whatever information it likes in the NodeId instances it 
creates, and we could use separate UUID instances (or simply identifier 
strings) to track node references and for things like 
Session.getNodeByIdentifier().

> Support sequential (non-random) node ids
> 
>
> Key: JCR-2857
> URL: https://issues.apache.org/jira/browse/JCR-2857
> Project: Jackrabbit Content Repository
>  Issue Type: Improvement
>  Components: jackrabbit-core
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
> Attachments: jcr-2857.patch
>
>
> Currently, node ids are generated using a (cryptographically secure pseudo-) 
> random number generator. This has a many advantages (easy to implement, easy 
> to merge nodes from multiple repositories or cluster nodes), but is a 
> performance bottleneck for large repositories.
> In addition to generating random node ids, Jackrabbit should support 
> generating sequential node ids.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (JCR-2857) Support sequential (non-random) node ids

2011-01-06 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/JCR-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978215#action_12978215
 ] 

Thomas Mueller commented on JCR-2857:
-

One solution to ensure uniqueness is to use a unique repository-wide id as the 
base or most significant bits of the node id. I think it's better to not change 
this base id on each startup. With the patch, it's already possible to emulate 
this (set the system property jackrabbit.sequentialNodeId to 
/, for example 
"14f0acef/0"). The patch let's you 'test drive' sequential node ids, and 
includes the necessary refactoring of the node id generation (the NodeIdFactory 
mechanism), but the patch doesn't generate a random base id automatically yet. 
I will change that: when the jackrabbit.sequentialNodeId is set to "true", use 
a random base id instead of 0/0.

I agree in the long term, it makes sense to let the persistence layer generate 
unique node ids. In my J3 prototype this is already implemented. For the 
current Jackrabbit code, it would mean a lot of changes because each component 
would need to have a reference to the persistence layer, or let the persistence 
layer generate node id factories. But I don't think Jackrabbit would be much 
faster if the persistence layer generates the node ids - just it would make 
sense on an architecture level in the long term. But if we anyway want to 
replace the current Jackrabbit code with new code it doesn't make sense to 
change that now.


> Support sequential (non-random) node ids
> 
>
> Key: JCR-2857
> URL: https://issues.apache.org/jira/browse/JCR-2857
> Project: Jackrabbit Content Repository
>  Issue Type: Improvement
>  Components: jackrabbit-core
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
> Attachments: jcr-2857.patch
>
>
> Currently, node ids are generated using a (cryptographically secure pseudo-) 
> random number generator. This has a many advantages (easy to implement, easy 
> to merge nodes from multiple repositories or cluster nodes), but is a 
> performance bottleneck for large repositories.
> In addition to generating random node ids, Jackrabbit should support 
> generating sequential node ids.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (JCR-2857) Support sequential (non-random) node ids

2011-01-10 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/JCR-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979597#action_12979597
 ] 

Thomas Mueller commented on JCR-2857:
-

Revision 1057220 and revision 1057229.

The feature is disabled by default. If the system property 
"jackrabbit.sequentialNodeId" is set to "true", then the most significant bits 
are set to a cryptographically secure random number, except for the bits that 
normally contain the UUID version number, which are set to 0 (so the node id 
can't clash with a version 1-5 UUID). That means the node id contains 56 bits 
of 'repository identifier'. This is good enough to run a few thousand cluster 
nodes; the probability of duplicate repository identifiers is  about 0.0003 
for 65536 repositories. But the feature doesn't provide the same guarantee for 
*globally* unique identifiers as normal UUIDs do. If such a guarantee is 
required, the most significant bits could be set to the MAC address (but in 
that case you could only use one repository per MAC address).

> Support sequential (non-random) node ids
> 
>
> Key: JCR-2857
> URL: https://issues.apache.org/jira/browse/JCR-2857
> Project: Jackrabbit Content Repository
>  Issue Type: Improvement
>  Components: jackrabbit-core
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
> Attachments: jcr-2857.patch
>
>
> Currently, node ids are generated using a (cryptographically secure pseudo-) 
> random number generator. This has a many advantages (easy to implement, easy 
> to merge nodes from multiple repositories or cluster nodes), but is a 
> performance bottleneck for large repositories.
> In addition to generating random node ids, Jackrabbit should support 
> generating sequential node ids.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.