[ https://issues.apache.org/jira/browse/JCR-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Mueller updated JCR-926: ------------------------------- Attachment: dataStore4.zip This patch contains a background garbage collection implementation. The algorithm is: - Remember when the scan started - Start an EventListener - Tell the DataStore to update the last modification date of files that are read (usually only storing files, or adding a link to an existing file updates the modification date, but now during the GC also reading does) - Recursively iterate through all nodes - If the node contains binary properties, start reading them, but close the input stream immediately. This updates the modification date - If new nodes are added, the EventListener does the same (recurse through all added nodes). Actually it would only be required to scan the moved nodes, but not sure how to do that. - The application needs to call 'scan' for each workspace (this is not done yet, not sure how to get the list of workspaces). - When the scan is done, wait one second. This is for the EventListener to catch up. How long do we have to wait for the observation listeners? Is there a way to 'force' Jackrabbit to call the observation listeners? - Then, delete all data records that where not modified since GC scan started. To test the garbage collection, there is also a simple application (BlobGCTest.java). This is not yet a unit test, it is a standalone application. It creates a few nodes: /node1 /node2 /node2/nodeWithBlob /node2/nodeWithTemporaryBlob Then it deletes nodeWithTemporaryBlob. The file is still in the data store afterwards. Then the garbage collection is started. While the scan is running, after node1 was scanned but before node2, the /node2/nodeWithBlob is moved to /node1/nodeWithBlob. Usually, the garbage collection wouldn't notice this (as the scan was past node1 already). But because of the EventListener, it scans the moved node as well (at the very end usually). The output is: scanning... scanned: /node1 moved /node2/nodeWithBlob to /node1 scanned: /node2 identifiers: 17ec4a160f44f9467b4204aa20e5981d9508c4df 74b5b1b26f806661292b9add2e78f671cf06f432 stop scanning... scanned: /node1/nodeWithBlob deleting... identifiers: 17ec4a160f44f9467b4204aa20e5981d9508c4df This is a patch for revision 553213 (actually the revision number is in the patch as well). To delete files early in the garbage collection scan, we could do this: A) If garbage collection was run before, see if there a file with the list of UUIDs ('uuids.txt'). B) If yes, and if the checksum is ok, read all those nodes first (if not so many). This updates the modified date of all old files that are still in use. Afterwards, delete all files with an older modified date than the last scan! Newer files, and files that are read have a newer modification date. C) Delete the 'uuids.txt' file (in any case). D) Iterate (recurse) through all nodes and properties like now. If a node has a binary property, store the UUID of the node in the file ('uuids.txt'). Also store the time when the scan started. E) Checksum and close the file. F) Like now, delete files with an older modification date than this scan. We can't use node path for this, UUIDs are required as nodes could be moved around. > Global data store for binaries > ------------------------------ > > Key: JCR-926 > URL: https://issues.apache.org/jira/browse/JCR-926 > Project: Jackrabbit > Issue Type: New Feature > Components: core > Reporter: Jukka Zitting > Attachments: dataStore.patch, DataStore.patch, DataStore2.patch, > dataStore3.patch, dataStore4.zip, internalValue.patch, ReadWhileSaveTest.patch > > > There are three main problems with the way Jackrabbit currently handles large > binary values: > 1) Persisting a large binary value blocks access to the persistence layer for > extended amounts of time (see JCR-314) > 2) At least two copies of binary streams are made when saving them through > the JCR API: one in the transient space, and one when persisting the value > 3) Versioining and copy operations on nodes or subtrees that contain large > binary values can quickly end up consuming excessive amounts of storage space. > To solve these issues (and to get other nice benefits), I propose that we > implement a global "data store" concept in the repository. A data store is an > append-only set of binary values that uses short identifiers to identify and > access the stored binary values. The data store would trivially fit the > requirements of transient space and transaction handling due to the > append-only nature. An explicit mark-and-sweep garbage collection process > could be added to avoid concerns about storing garbage values. > See the recent NGP value record discussion, especially [1], for more > background on this idea. > [1] > http://mail-archives.apache.org/mod_mbox/jackrabbit-dev/200705.mbox/[EMAIL > PROTECTED] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.