Hi,

On 6/13/07, Thomas Mueller <[EMAIL PROTECTED]> wrote:
I just read about the Global Data Store proposal
(https://issues.apache.org/jira/browse/JCR-926) and I think it's a
great idea:

Thanks. :-)

However I am not sure if mark-and-sweep garbage collection (GC) is the
best solution:

- Very slow for large repositories
- Need to stop everything to
- Frees up space very late

I agree that it's slow and late, but I don't think either is a big
problem. The garbage collection process can be run in the background
(it doesn't block normal access) so performance isn't essential, and
given the amount of space that the approach saves in typical setups
I'm not too worried about reclaiming unused space later than
necessary. The only problem would be when a user removes a huge file
or many files at once in an attempt to release disk space, or if a DoS
attack tries to fill the disk with dummy data. The former case is
probably best handled by explicitly starting the garbage collector and
the latter problem is IMHO best handled on the application level.

To avoid problems with mark-and-sweep GC, I would use reference
counting and file renaming.

The main problem I have with reference counting in this case is that
it would bind the data store into transaction handling and all related
issues. It would also introduce locking inside the data store to avoid
problems with concurrent reference changes.

I would store small items (up to 1 KB or so) as like regular values,
to avoid lots of tiny files.

That should be fairly easy to achieve with the current DataIdentifier interface:

   public class InlineDataIdentifier implements DataIdentifier {
       private final byte[] data;
       ...
   }

In the future, why not store large Strings in the global data store as
well. Some applications store large XML documents as Strings. However
this is not urgent: large Strings can be stored as binary values.

I was thinking of perhaps adding isString() and getString() methods to
DataRecord for checking whether a given binary stream is valid UTF-8
and for retrieving the encoded string value in case it is. It should
be easy to do the UTF-8 validation with almost zero overhead while the
binary stream is being imported.

Together with the above inline mechanism we should in fact be able to
make no distinction between binary and string values in the
persistence layer.

BR,

Jukka Zitting

Reply via email to