[ https://issues.apache.org/jira/browse/OAK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Francesco Mari updated OAK-3536: -------------------------------- Fix Version/s: (was: 1.3.9) 1.4 > Indexing with Lucene and copy-on-read generate too much garbage in the > BlobStore > -------------------------------------------------------------------------------- > > Key: OAK-3536 > URL: https://issues.apache.org/jira/browse/OAK-3536 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene > Affects Versions: 1.3.9 > Reporter: Francesco Mari > Priority: Critical > Fix For: 1.4 > > > The copy-on-read strategy when using Lucene indexing performs too many copies > of the index files from the filesystem to the repository. Every copy discards > the previously stored binary, that sits there as garbage until the binary > garbage collection kicks in. When the load on the system is particularly > intense, this behaviour makes the repository grow at an unreasonable high > pace. > I spotted this on a system where some content is generated every day at a > specific time. The content generation process creates approx. 6 millions new > nodes, where each node contains 5 properties with small string, random > values. Nodes were saved in batches of 1000 nodes each. At the end of the > content generation process, the nodes are deleted to deliberately generate > garbage in the Segment Store. This is part of a testing effort to assess the > efficiency of the online compaction. > I was never able to complete the tests because the system run out of disk > space due to a lot of unused binary values. When debugging the system, on a > 400 GB (full) disk, the segments containing nodes and property values > occupied approx. 3 GB. The rest of the space was occupied by binary values in > form of bulk segments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)