[ https://issues.apache.org/jira/browse/JCR-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Mueller updated JCR-926: ------------------------------- Attachment: dataStore.patch Hi, This is a refactoring patch for GlobalDataStore. The patch introduces DataStore (almost) wherever it is required, but the behavior is not yet changed (the data store is disabled). This patch may break backwards compatibility. NodeImpl.internalCopyPropertyFrom: Never used, removed. ItemStateBinding.readState and writeState: Never used, removed. Deprecated class org.apache.jackrabbit.core.state.PMContext and org.apache.jackrabbit.core.state.util.Serializer: Removed. Adding a parameter would break backwards compatibility anyway. The parameter 'DataStore store' was added to many constructors and methods. I don't like it. Would there be a better way to do it? Idea: create a new class 'RepositoryContext' with getNodeTypeRegistry(), maybe getNamespaceResolver(), getNamespaceRegistry(), and getDataStore(). Pass this object where appropriate. Sometimes BLOBs are used only for a short time. I renamed the method create(InputStream in) to createTemporary. BLOBFileValue is now an abstract class. The original implementation was renamed to 'BLOBFileValueOld'. This is only a temporary class (until DataStore is done). There is also BLOBFileValueMemory for very small binary properties (a few hundres bytes), but currently not used. The DataStore parameter is still missing in InternalValue.valueOf (this method is never called for BINARY types), this will be changed. InternalValue: BOOLEAN_TRUE and BOOLEAN_FALSE is fixed now. A few notes about the FileDataStore implementation: I didn't change Jukka's implementation so far, but I have a few ideas: Currently all files are stored in the same directory. However this is a problem for Windows XP (and may be other file systems). I would limit the number of files in the data store root directory to 1024. Afterwards, create subdirectories data1024-2047, data2048-3071,... with 1024 files each. When required, FileDataStore reads the directory list. If faster, one index file per directory could be created. The file name is currently the SHA-1 digest. I suggest to use SHA-256 (unless it is a lot slower or not available on some systems). Yes you can call me paranoid. SHA-1 could be broken in a few years. As the file name, I would use: <id>-<digest>.data. As the DataIdentifier, use <id>-<digest>. This would speed up finding files when reading, as (id / 1024) is the directory (direct lookup). Also this would allow to bundle data files in tar files. Tar file support would be priority 2. I would only bundle very small (< 4 KB) files in tar files anyway. Priority 3 would be compression (for text data mainly). There is no garbage collection at this time. This still needs to be implemented. Thomas > Global data store for binaries > ------------------------------ > > Key: JCR-926 > URL: https://issues.apache.org/jira/browse/JCR-926 > Project: Jackrabbit > Issue Type: New Feature > Components: core > Reporter: Jukka Zitting > Attachments: dataStore.patch, DataStore.patch, DataStore2.patch, > internalValue.patch, ReadWhileSaveTest.patch > > > There are three main problems with the way Jackrabbit currently handles large > binary values: > 1) Persisting a large binary value blocks access to the persistence layer for > extended amounts of time (see JCR-314) > 2) At least two copies of binary streams are made when saving them through > the JCR API: one in the transient space, and one when persisting the value > 3) Versioining and copy operations on nodes or subtrees that contain large > binary values can quickly end up consuming excessive amounts of storage space. > To solve these issues (and to get other nice benefits), I propose that we > implement a global "data store" concept in the repository. A data store is an > append-only set of binary values that uses short identifiers to identify and > access the stored binary values. The data store would trivially fit the > requirements of transient space and transaction handling due to the > append-only nature. An explicit mark-and-sweep garbage collection process > could be added to avoid concerns about storing garbage values. > See the recent NGP value record discussion, especially [1], for more > background on this idea. > [1] > http://mail-archives.apache.org/mod_mbox/jackrabbit-dev/200705.mbox/[EMAIL > PROTECTED] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.