[ https://issues.apache.org/jira/browse/JCR-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916051#action_12916051 ]
Thomas Mueller commented on JCR-2760: ------------------------------------- I'm not sure if this is really a big problem, so this is just FYI: With 24 bit, the probability of collision is higher. With 1024 entries, the probability is about 3%. The 1% Jukka mentioned is correct for 32 bit. The same formula as for UUIDs applies, see also http://en.wikipedia.org/wiki/Universally_unique_identifier and http://www.h2database.com/html/advanced.html#uuid 2^5=32 probability: 0.003051711246848665% 2^6=64 probability: 0.012206286222260498% 2^7=128 probability: 0.04881620601105974% 2^8=256 probability: 0.19512188925244756% 2^9=512 probability: 0.7782061739756485% 2^10=1024 probability: 3.076676552365587% 2^11=2048 probability: 11.750309741540455% 2^12=4096 probability: 39.346934028736655% 2^13=8192 probability: 86.46647167633873% 2^14=16384 probability: 99.96645373720975% double x = Math.pow(2, 24); for (int i = 5; i < 15; i++) { double n = Math.pow(2, i); double p = 1 - Math.exp(-(n * n) / 2 / x); System.out.println("2^" + i + "=" + (1L << i) + " probability: " + (p * 100) + "%"); } > Use hash codes instead of sequence numbers for string indexes > ------------------------------------------------------------- > > Key: JCR-2760 > URL: https://issues.apache.org/jira/browse/JCR-2760 > Project: Jackrabbit Content Repository > Issue Type: Improvement > Components: jackrabbit-core > Reporter: Jukka Zitting > Priority: Minor > Attachments: > 0001-JCR-2760-Use-hash-codes-instead-of-sequence-numbers.patch > > > We use index numbers instead of namespace URIs or other strings in many > places. The two-way mapping between namespace URIs and index numbers is by > default stored in the repository-global ns_idx.properties file, and the index > numbers are allocated using a linear sequence. The problem with this approach > is that two repositories will easily end up with different string index > mappings, which makes it practically impossible to make low-level copies of > workspace content across repositories. > The ultimate solution for this problem would be to store the namespace URIs > closer to the stored content, ideally as an implementation detail of a > persistence manager. > An easier short-term solution would be to decrease the chances of two > repositories having different string index mappings. A simple (and > backwards-compatible) way to do this is to use the hash code of a namespace > URI as the basis of allocating a new index number. Hash collisions are fairly > unlikely, and can be handled by incrementing the intial hash code until the > collision is avoided. In the common case of no collisions (with a uniform > hash function the chance of a collision is less than 1% even with tousands of > registered namespaces) this solution allows workspaces to be copied between > repositories without worrying about the namespace index mappings. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.