[jira] Updated: (JCR-926) Global data store for binaries

Thomas Mueller (JIRA) Thu, 28 Jun 2007 03:45:47 -0700

     [ 
https://issues.apache.org/jira/browse/JCR-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thomas Mueller updated JCR-926:
-------------------------------

    Attachment: dataStore.patch

Hi,

This is a refactoring patch for GlobalDataStore. The patch introduces DataStore 
(almost) wherever it is required, but the behavior is not yet changed (the data 
store is disabled). This patch may break backwards compatibility.

NodeImpl.internalCopyPropertyFrom: Never used, removed.

ItemStateBinding.readState and writeState: Never used, removed.

Deprecated class org.apache.jackrabbit.core.state.PMContext and 
org.apache.jackrabbit.core.state.util.Serializer: Removed. Adding a parameter 
would break backwards compatibility anyway.

The parameter 'DataStore store' was added to many constructors and methods. I 
don't like it. Would there be a better way to do it? Idea: create a new class 
'RepositoryContext' with getNodeTypeRegistry(), maybe getNamespaceResolver(), 
getNamespaceRegistry(), and getDataStore(). Pass this object where appropriate.

Sometimes BLOBs are used only for a short time. I renamed the method 
create(InputStream in) to createTemporary.

BLOBFileValue is now an abstract class. The original implementation was renamed 
to 'BLOBFileValueOld'. This is only a temporary class (until DataStore is 
done). There is also BLOBFileValueMemory for very small binary properties (a 
few hundres bytes), but currently not used.

The DataStore parameter is still missing in InternalValue.valueOf (this method 
is never called for BINARY types), this will be changed.

InternalValue: BOOLEAN_TRUE and BOOLEAN_FALSE is fixed now. 



A few notes about the FileDataStore implementation:

I didn't change Jukka's implementation so far, but I have a few ideas:

Currently all files are stored in the same directory. However this is a problem 
for Windows XP (and may be other file systems). I would limit the number of 
files in the data store root directory to 1024. Afterwards, create 
subdirectories data1024-2047, data2048-3071,... with 1024 files each. When 
required, FileDataStore reads the directory list. If faster, one index file per 
directory could be created. 

The file name is currently the SHA-1 digest. I suggest to use SHA-256 (unless 
it is a lot slower or not available on some systems). Yes you can call me 
paranoid. SHA-1 could be broken in a few years.

As the file name, I would use: <id>-<digest>.data. As the DataIdentifier, use 
<id>-<digest>. This would speed up finding files when reading, as (id / 1024) 
is the directory (direct lookup). Also this would allow to bundle data files in 
tar files. Tar file support would be priority 2. I would only bundle very small 
(< 4 KB) files in tar files anyway. Priority 3 would be compression (for text 
data mainly).

There is no garbage collection at this time. This still needs to be implemented.

Thomas


> Global data store for binaries
> ------------------------------
>
>                 Key: JCR-926
>                 URL: https://issues.apache.org/jira/browse/JCR-926
>             Project: Jackrabbit
>          Issue Type: New Feature
>          Components: core
>            Reporter: Jukka Zitting
>         Attachments: dataStore.patch, DataStore.patch, DataStore2.patch, 
> internalValue.patch, ReadWhileSaveTest.patch
>
>
> There are three main problems with the way Jackrabbit currently handles large 
> binary values:
> 1) Persisting a large binary value blocks access to the persistence layer for 
> extended amounts of time (see JCR-314)
> 2) At least two copies of binary streams are made when saving them through 
> the JCR API: one in the transient space, and one when persisting the value
> 3) Versioining and copy operations on nodes or subtrees that contain large 
> binary values can quickly end up consuming excessive amounts of storage space.
> To solve these issues (and to get other nice benefits), I propose that we 
> implement a global "data store" concept in the repository. A data store is an 
> append-only set of binary values that uses short identifiers to identify and 
> access the stored binary values. The data store would trivially fit the 
> requirements of transient space and transaction handling due to the 
> append-only nature. An explicit mark-and-sweep garbage collection process 
> could be added to avoid concerns about storing garbage values.
> See the recent NGP value record discussion, especially [1], for more 
> background on this idea.
> [1] 
> http://mail-archives.apache.org/mod_mbox/jackrabbit-dev/200705.mbox/[EMAIL 
> PROTECTED]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (JCR-926) Global data store for binaries

Reply via email to