Hi, On 6/13/07, Thomas Mueller <[EMAIL PROTECTED]> wrote:
I just read about the Global Data Store proposal (https://issues.apache.org/jira/browse/JCR-926) and I think it's a great idea:
Thanks. :-)
However I am not sure if mark-and-sweep garbage collection (GC) is the best solution: - Very slow for large repositories - Need to stop everything to - Frees up space very late
I agree that it's slow and late, but I don't think either is a big problem. The garbage collection process can be run in the background (it doesn't block normal access) so performance isn't essential, and given the amount of space that the approach saves in typical setups I'm not too worried about reclaiming unused space later than necessary. The only problem would be when a user removes a huge file or many files at once in an attempt to release disk space, or if a DoS attack tries to fill the disk with dummy data. The former case is probably best handled by explicitly starting the garbage collector and the latter problem is IMHO best handled on the application level.
To avoid problems with mark-and-sweep GC, I would use reference counting and file renaming.
The main problem I have with reference counting in this case is that it would bind the data store into transaction handling and all related issues. It would also introduce locking inside the data store to avoid problems with concurrent reference changes.
I would store small items (up to 1 KB or so) as like regular values, to avoid lots of tiny files.
That should be fairly easy to achieve with the current DataIdentifier interface: public class InlineDataIdentifier implements DataIdentifier { private final byte[] data; ... }
In the future, why not store large Strings in the global data store as well. Some applications store large XML documents as Strings. However this is not urgent: large Strings can be stored as binary values.
I was thinking of perhaps adding isString() and getString() methods to DataRecord for checking whether a given binary stream is valid UTF-8 and for retrieving the encoded string value in case it is. It should be easy to do the UTF-8 validation with almost zero overhead while the binary stream is being imported. Together with the above inline mechanism we should in fact be able to make no distinction between binary and string values in the persistence layer. BR, Jukka Zitting