[ https://issues.apache.org/jira/browse/OAK-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316407#comment-15316407 ]
Amit Jain commented on OAK-4200: -------------------------------- [~tmueller], [~chetanm] Could you please review the changes done @ https://github.com/amit-jain/jackrabbit-oak/commit/a0a743467629b6695d1ed1616cf6fc85e2f6610b > [BlobGC] Improve collection times of blobs available > ---------------------------------------------------- > > Key: OAK-4200 > URL: https://issues.apache.org/jira/browse/OAK-4200 > Project: Jackrabbit Oak > Issue Type: Improvement > Reporter: Amit Jain > Assignee: Amit Jain > Fix For: 1.5.4 > > > The blob collection phase (Identifying all the blobs available in the data > store) is quite an expensive part of the whole GC process, taking up a few > hours sometimes on large repositories, due to iteration of the sub-folders in > the data store. > In an offline discussion with [~tmueller] and [~chetanm], the idea came up > that this phase can be faster if > * Blobs ids are tracked when the blobs are added for e.g. in a simple file > in the datastore per cluster node. > * GC then consolidates this file from all the cluster nodes and uses it to > get the candidates for GC. > * This variant of the MarkSweepGC can be triggered more frequently. It would > be ok to miss blob id additions to this file during a crash etc., as these > blobs can be cleaned up in the *regular* MarkSweepGC cycles triggered > occasionally. > We also may be able to track other metadata along with the blob ids like > paths, timestamps etc. for auditing/analytics, in-conjunction with OAK-3140. -- This message was sent by Atlassian JIRA (v6.3.4#6332)