[jira] Commented: (JCR-2760) Use hash codes instead of sequence numbers for string indexes

Thomas Mueller (JIRA) Tue, 28 Sep 2010 23:43:04 -0700

    [ 
https://issues.apache.org/jira/browse/JCR-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916051#action_12916051
 ]


Thomas Mueller commented on JCR-2760:
-------------------------------------

I'm not sure if this is really a big problem, so this is just FYI: With 24 bit, 
the probability of collision is higher. With 1024 entries, the probability is 
about 3%. The 1% Jukka mentioned is correct for 32 bit. The same formula as for 
UUIDs applies, see also 
http://en.wikipedia.org/wiki/Universally_unique_identifier and 
http://www.h2database.com/html/advanced.html#uuid

2^5=32 probability: 0.003051711246848665%
2^6=64 probability: 0.012206286222260498%
2^7=128 probability: 0.04881620601105974%
2^8=256 probability: 0.19512188925244756%
2^9=512 probability: 0.7782061739756485%
2^10=1024 probability: 3.076676552365587%
2^11=2048 probability: 11.750309741540455%
2^12=4096 probability: 39.346934028736655%
2^13=8192 probability: 86.46647167633873%
2^14=16384 probability: 99.96645373720975%

double x = Math.pow(2, 24);
for (int i = 5; i < 15; i++) {
    double n = Math.pow(2, i);
    double p = 1 - Math.exp(-(n * n) / 2 / x);
    System.out.println("2^" + i + "=" + (1L << i) +
            " probability: " + (p * 100) + "%");
}



> Use hash codes instead of sequence numbers for string indexes
> -------------------------------------------------------------
>
>                 Key: JCR-2760
>                 URL: https://issues.apache.org/jira/browse/JCR-2760
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Jukka Zitting
>            Priority: Minor
>         Attachments: 
> 0001-JCR-2760-Use-hash-codes-instead-of-sequence-numbers.patch
>
>
> We use index numbers instead of namespace URIs or other strings in many 
> places. The two-way mapping between namespace URIs and index numbers is by 
> default stored in the repository-global ns_idx.properties file, and the index 
> numbers are allocated using a linear sequence. The problem with this approach 
> is that two repositories will easily end up with different string index 
> mappings, which makes it practically impossible to make low-level copies of 
> workspace content across repositories.
> The ultimate solution for this problem would be to store the namespace URIs 
> closer to the stored content, ideally as an implementation detail of a 
> persistence manager.
> An easier short-term solution would be to decrease the chances of two 
> repositories having different string index mappings. A simple (and 
> backwards-compatible) way to do this is to use the hash code of a namespace 
> URI as the basis of allocating a new index number. Hash collisions are fairly 
> unlikely, and can be handled by incrementing the intial hash code until the 
> collision is avoided. In the common case of no collisions (with a uniform 
> hash function the chance of a collision is less than 1% even with tousands of 
> registered namespaces) this solution allows workspaces to be copied between 
> repositories without worrying about the namespace index mappings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-2760) Use hash codes instead of sequence numbers for string indexes

Reply via email to