[ 
https://issues.apache.org/jira/browse/OAK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francesco Mari updated OAK-3536:
--------------------------------
    Fix Version/s:     (was: 1.3.9)
                   1.4

> Indexing with Lucene and copy-on-read generate too much garbage in the 
> BlobStore
> --------------------------------------------------------------------------------
>
>                 Key: OAK-3536
>                 URL: https://issues.apache.org/jira/browse/OAK-3536
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: lucene
>    Affects Versions: 1.3.9
>            Reporter: Francesco Mari
>            Priority: Critical
>             Fix For: 1.4
>
>
> The copy-on-read strategy when using Lucene indexing performs too many copies 
> of the index files from the filesystem to the repository. Every copy discards 
> the previously stored binary, that sits there as garbage until the binary 
> garbage collection kicks in. When the load on the system is particularly 
> intense, this behaviour makes the repository grow at an unreasonable high 
> pace. 
> I spotted this on a system where some content is generated every day at a 
> specific time. The content generation process creates approx. 6 millions new 
> nodes, where each node contains 5 properties with small string, random 
> values. Nodes were saved in batches of 1000 nodes each. At the end of the 
> content generation process, the nodes are deleted to deliberately generate 
> garbage in the Segment Store. This is part of a testing effort to assess the 
> efficiency of the online compaction.
> I was never able to complete the tests because the system run out of disk 
> space due to a lot of unused binary values. When debugging the system, on a 
> 400 GB (full) disk, the segments containing nodes and property values 
> occupied approx. 3 GB. The rest of the space was occupied by binary values in 
> form of bulk segments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to