Hi,

I just read about the Global Data Store proposal
(https://issues.apache.org/jira/browse/JCR-926) and I think it's a
great idea:

- Avoid blocking the engine when streaming large objects
- Avoid copying twice (first transient, then persistent store)
- Versioning: avoid multiple copies of the same object

However I am not sure if mark-and-sweep garbage collection (GC) is the
best solution:

- Very slow for large repositories
- Need to stop everything to
- Frees up space very late

To avoid problems with mark-and-sweep GC, I would use reference
counting and file renaming. See also
http://en.wikipedia.org/wiki/Reference_counting. Algorithm:

- While the value is still transient, the file name ends with '.0'
- When persisted, rename the file ('.1')
- When adding a reference to an existing object (link), rename to
'.2', '.3', and so on
- When a reference is deleted (unlink), decrement the counter; delete
the file if '.0'
- At repository startup, delete '.0' files (transient objects after
the repository was killed)

There are some issues to solve: should the file be renamed when the
value is added/deleted, or when the add/delete is committed?

Files should be read-only; in theory they could be changed while the
reference count is below 2.

I would store small items (up to 1 KB or so) as like regular values,
to avoid lots of tiny files.

I wrote 'files' above; I know this could be something else (database,
Amazon S3). Reference counts could be kept somewhere else if you don't
like renaming files (I like it), or if renaming is not possible.

In the future, why not store large Strings in the global data store as
well. Some applications store large XML documents as Strings. However
this is not urgent: large Strings can be stored as binary values.

Thomas

Reply via email to