Hi,
mark-and-sweep garbage I agree that it's slow and late, but I don't think either is a big problem.
I think it is. But it is not very important to decide what garbage collection algorithm to use at this stage. It is still possible to switch the algorithm later on. OK it is a bit more work.
The garbage collection process can be run in the background (it doesn't block normal access) so performance isn't essential
It can't run while others are writing to the repository, and I think that's a problem. Example: Lets say the garbage collection algorithm scans the repository from left to right, and 'A' is a link to 'File A'. Now at the beginning of the scan, the repository looks like this: S----------------------------------------------------------A--- after some time: ----------S------------------------------------------------A--- now somebody moves the node that contains A: ---A------------------S----------------------------------------- the scan finishes and didn't find a reference to A: ---A-----------------------------------------------------------S
given the amount of space that the approach saves in typical setups I'm not too worried about reclaiming unused space later than necessary.
That depends on the setup. If you use the repository to manage movies (no versioning), then I would be worried.
The main problem I have with reference counting in this case is that it would bind the data store into transaction handling
Yes, a little bit.
and all related issues.
Could you clarify? I would increment the counts early (before committing) and decrement the counts late (after the commit), then the worst case is, after a crash, to have a counter that is too high (seldom). Actually, what about back references: each large object knows who (it thinks) is pointing to it. Mark and sweep would then be trivial. The additional space used would be minimal (compared to a large object).
It would also introduce locking inside the data store to avoid problems with concurrent reference changes.
Manipulating references to large objects is not that common I think: moving nodes (maybe) and versioning. I would use simple 'synchronized' blocks.
> why not store large Strings in the global data store I was thinking of perhaps adding isString() and getString() methods
DataRecord for checking whether a given binary stream is valid UTF-8 and for retrieving the encoded string value in case it is.
I probably lost you here. The application decides if it wants to use PropertyType.STRING or PropertyType.BINARY. No need to guess the type from the byte array. I was thinking about storing large instances of PropertyType.STRING (java.lang.String) as a file.
Together with the above inline mechanism we should in fact be able to make no distinction between binary and string values in the persistence layer.
Yes. You could add a property 'isLarge' to InternalValue, or you could extend InternalValue. Actually I think InternalValue is quite memory intensive, it uses two objects for each INTEGER. I suggest to use an interface, and InternalValueInt, InternalValueString, InternalValueLong and so on. And/or use a cache for the most commonly used objects (integer 0-1000, empty String, boolean true/false). But that's another discussion. Sorry. Thomas
