[ https://issues.apache.org/jira/browse/OAK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906356#comment-14906356 ]
Thomas Mueller commented on OAK-2808: ------------------------------------- http://svn.apache.org/r1705054 (trunk), partial implementation. Remarks: * Disabled by default (DEFAULT_ACTIVE_DELETE = -1). * Delete of entries is not yet implemented; this would have to be done in oak-core I guess. * To enable, set the property "activeDelete" to, for example, 3600 (1 hour), for an index of type "lucene" * For manual testing, I removed the "async" = "async" flag from the index - this will create garbage quickly. * A new child node ":trash" is created, with child nodes "run_1", "run_2",... and a property "index" for the next id. * For each file, a new property "uniqueKey" (16 bytes) is created, and that key is appended to the file (ignored when reading) The block size per binary increased by 16 bytes. I wonder if, for MongoDB, it would be better to use 1024 bytes less, as we do for the MongoBlobStore, because MongoDB rounds up the space allocated for a record to the next power of two (there is an overhead per record, let's assume it is 1 KB at most) The missing "delete trash" feature would need to periodically (and asynchronously) read the first "run_.." entries, and delete the binaries if needed. It would probably have to maintain a "deleteIndex", similar to the "index" property used to create new entries. > Active deletion of 'deleted' Lucene index files from DataStore without > relying on full scale Blob GC > ---------------------------------------------------------------------------------------------------- > > Key: OAK-2808 > URL: https://issues.apache.org/jira/browse/OAK-2808 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene > Reporter: Chetan Mehrotra > Assignee: Thomas Mueller > Labels: datastore, performance > Fix For: 1.3.7 > > Attachments: OAK-2808-1.patch, copyonread-stats.png > > > With storing of Lucene index files within DataStore our usage pattern > of DataStore has changed between JR2 and Oak. > With JR2 the writes were mostly application based i.e. if application > stores a pdf/image file then that would be stored in DataStore. JR2 by > default would not write stuff to DataStore. Further in deployment > where large number of binary content is present then systems tend to > share the DataStore to avoid duplication of storage. In such cases > running Blob GC is a non trivial task as it involves a manual step and > coordination across multiple deployments. Due to this systems tend to > delay frequency of GC > Now with Oak apart from application the Oak system itself *actively* > uses the DataStore to store the index files for Lucene and there the > churn might be much higher i.e. frequency of creation and deletion of > index file is lot higher. This would accelerate the rate of garbage > generation and thus put lot more pressure on the DataStore storage > requirements. > Discussion thread http://markmail.org/thread/iybd3eq2bh372zrl -- This message was sent by Atlassian JIRA (v6.3.4#6332)