[ 
https://issues.apache.org/jira/browse/OAK-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316407#comment-15316407
 ] 

Amit Jain commented on OAK-4200:
--------------------------------

[~tmueller], [~chetanm] Could you please review the changes done @ 
https://github.com/amit-jain/jackrabbit-oak/commit/a0a743467629b6695d1ed1616cf6fc85e2f6610b

> [BlobGC] Improve collection times of blobs available
> ----------------------------------------------------
>
>                 Key: OAK-4200
>                 URL: https://issues.apache.org/jira/browse/OAK-4200
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>            Reporter: Amit Jain
>            Assignee: Amit Jain
>             Fix For: 1.5.4
>
>
> The blob collection phase (Identifying all the blobs available in the data 
> store) is quite an expensive part of the whole GC process, taking up a few 
> hours sometimes on large repositories, due to iteration of the sub-folders in 
> the data store.
> In an offline discussion with [~tmueller] and [~chetanm], the idea came up 
> that this phase can be faster if
> *  Blobs ids are tracked when the blobs are added for e.g. in a simple file 
> in the datastore per cluster node.
> * GC then consolidates this file from all the cluster nodes and uses it to 
> get the candidates for GC.
> * This variant of the MarkSweepGC can be triggered  more frequently. It would 
> be ok to miss blob id additions to this file during a crash etc., as these 
> blobs can be cleaned up in the *regular* MarkSweepGC cycles triggered 
> occasionally.
> We also may be able to track other metadata along with the blob ids like 
> paths, timestamps etc. for auditing/analytics, in-conjunction with OAK-3140.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to