Hi,

mark-and-sweep garbage
I agree that it's slow and late, but I don't think either is a big
problem.

I think it is. But it is not very important to decide what garbage
collection algorithm to use at this stage. It is still possible to
switch the algorithm later on. OK it is a bit more work.

The garbage collection process can be run in the background
(it doesn't block normal access) so performance isn't essential

It can't run while others are writing to the repository, and I think
that's a problem. Example: Lets say the garbage collection algorithm
scans the repository from left to right, and 'A' is a link to 'File
A'. Now at the beginning of the scan, the repository looks like this:
S----------------------------------------------------------A---
after some time:
----------S------------------------------------------------A---
now somebody moves the node that contains A:
---A------------------S-----------------------------------------
the scan finishes and didn't find a reference to A:
---A-----------------------------------------------------------S

given the amount of space that the approach saves in typical setups
I'm not too worried about reclaiming unused space later than
necessary.

That depends on the setup. If you use the repository to manage movies
(no versioning), then I would be worried.

The main problem I have with reference counting in this case is that
it would bind the data store into transaction handling

Yes, a little bit.

and all related issues.

Could you clarify? I would increment the counts early (before
committing) and decrement the counts late (after the commit), then the
worst case is, after a crash, to have a counter that is too high
(seldom).

Actually, what about back references: each large object knows who (it
thinks) is pointing to it. Mark and sweep would then be trivial. The
additional space used would be minimal (compared to a large object).

It would also introduce locking inside the data store to avoid
problems with concurrent reference changes.

Manipulating references to large objects is not that common I think:
moving nodes (maybe) and versioning. I would use simple 'synchronized'
blocks.

> why not store large Strings in the global data store
I was thinking of perhaps adding isString() and getString() methods

DataRecord for checking whether a given binary stream is valid UTF-8
and for retrieving the encoded string value in case it is.

I probably lost you here. The application decides if it wants to use
PropertyType.STRING or PropertyType.BINARY. No need to guess the type
from the byte array. I was thinking about storing large instances of
PropertyType.STRING (java.lang.String) as a file.

Together with the above inline mechanism we should in fact be able to
make no distinction between binary and string values in the
persistence layer.

Yes. You could add a property 'isLarge' to InternalValue, or you could
extend InternalValue. Actually I think InternalValue is quite memory
intensive, it uses two objects for each INTEGER. I suggest to use an
interface, and InternalValueInt, InternalValueString,
InternalValueLong and so on. And/or use a cache for the most commonly
used objects (integer 0-1000, empty String, boolean true/false). But
that's another discussion. Sorry.

Thomas

Reply via email to