Hi, I just read about the Global Data Store proposal (https://issues.apache.org/jira/browse/JCR-926) and I think it's a great idea:
- Avoid blocking the engine when streaming large objects - Avoid copying twice (first transient, then persistent store) - Versioning: avoid multiple copies of the same object However I am not sure if mark-and-sweep garbage collection (GC) is the best solution: - Very slow for large repositories - Need to stop everything to - Frees up space very late To avoid problems with mark-and-sweep GC, I would use reference counting and file renaming. See also http://en.wikipedia.org/wiki/Reference_counting. Algorithm: - While the value is still transient, the file name ends with '.0' - When persisted, rename the file ('.1') - When adding a reference to an existing object (link), rename to '.2', '.3', and so on - When a reference is deleted (unlink), decrement the counter; delete the file if '.0' - At repository startup, delete '.0' files (transient objects after the repository was killed) There are some issues to solve: should the file be renamed when the value is added/deleted, or when the add/delete is committed? Files should be read-only; in theory they could be changed while the reference count is below 2. I would store small items (up to 1 KB or so) as like regular values, to avoid lots of tiny files. I wrote 'files' above; I know this could be something else (database, Amazon S3). Reference counts could be kept somewhere else if you don't like renaming files (I like it), or if renaming is not possible. In the future, why not store large Strings in the global data store as well. Some applications store large XML documents as Strings. However this is not urgent: large Strings can be stored as binary values. Thomas