Hi, On 4/24/07, David Nuescheler <[EMAIL PROTECTED]> wrote:
> A value record would essentially be an array of bytes as defined in > Value.getStream(). In other words the integer value 123 and the string > value "123" would both be stored in the same value record. More > specific typing information would be indicated in the property record > that refers to that value. For example an integer property and a > string property could both point to the same value record, but have > different property types that indicate the default interpretation of > the value. i think that with small values we have to keep in mind that the "key" (value identifier) may be bigger than the actual value and of course the additional indirection also has a performance impact. do you think that we should consider a minimum size for value's to key stored in this manner? personally, i think that this might make sense.
For consistency I would use such value records for all values, regardless of the value size. I'd like to keep the value identifiers as short as possible, optimally just 64 bits, to avoid too much storage and bandwidth overhead. The indirection costs could probably best be avoided by storing copies of short value contents along with the value identifiers where the values are referenced.
anyway, what key did you have in mind? i would assume some sort of a hash (md5) could be great or is this still more abstract?
I was thinking about something more concrete, like a direct disk offset. The value identifier could for example be a 64 bit integer with the first 32 bits identifying the revision that contains the value and the last 32 bits being the offset of the value record within a "value file". I haven't yet calculated whether such a scheme gives us a large enough identifier space.
> Name and path values are stored as strings using namespace prefixes > from an internal namespace registry. Stability of such values is > enforced by restricting this internal namespace registry to never > remove or modify existing prefix mappings, only new namespace mappings > can be added. sounds good, i assume that the "internal" namespace registry gets its initial prefix mappings from the "public" namespace registry? i think having the same prefixes could be beneficial since remappings and removals are very rare even in the public registry and this would allow us to optimize the more typical case even better.
Exactly. In most cases, like when using the standard JCR prefix mappings, the stored name and path values can be passed as-is through the JCR API.
> Achieving uniqueness of the value records requires a way to determine > whether an instance of a given value already exists. Some indexing is > needed to avoid having to traverse the entire set of existing value > records for each new value being created. i agree and i think we have to make sure that the overhead of calculating the key (value identifier) is reasonable, so "insert performance" doesn't suffer too much.
Note that the "value key" can well be different from the value identifier. I was thinking of using something like a standard (and fast) CRC code as the hash key for looking up potential matches. For large binaries we could also calculate a SHA checksum to avoid having to read through the entire byte stream when checking for equality. For short values the CRC coupled with an exact byte comparison should be good enough.
i could even see an asynchronous model that "inlines" values of all sizes initially and then leaves it up to some sort of garbage collection job to "extract" the large values and stores them as immutable value records... this could preserve "insert performance" and allows to benefit from efficient operations for things like copy, clone, etc and of course the space consumption benefits.
I would be ready to trade some insert performance for more consistency, but let's see how much the cost would be in practice. BR, Jukka Zitting