[jira] [Commented] (OAK-4200) [BlobGC] Improve collection times of blobs available

Thomas Mueller (JIRA) Thu, 16 Jun 2016 02:34:05 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333451#comment-15333451
 ]


Thomas Mueller commented on OAK-4200:
-------------------------------------

* There are lots of "import static". I would prefer not to do that, and use 
regular imports instead.
* FileLineDifferenceIterator and so on don't seem to support strings with 
newline (and possibly other characters such as \r). This needs to be either (a) 
preferred: implemented and tested, or (b) documented, and the code needs to be 
reviewed to ensure such characters can never occur.
* String instanceId = String.valueOf(currentTimeMillis()) - it looks like this 
id needs to be globally unique. But using the current time millis is not enough 
to ensure that. A UUID needs to be used instead, or in addition, to the time.
* DataRecord input; input.getIdentifier().toString().contains(instanceId): that 
assumes instanceId is more than just a counter, otherwise it can give false 
positives. But nowhere is it obvious that instanceId _can't_ be a counter. So 
this is a hack in my view, and far too dangerous.
* "The iterator returned ia a Closeable instance". I would prefer using a new 
interface instead of relying on "instanceof" and casting.
* globalMerge deletes metadata records. I'm not sure what those metdata records 
are (documentation is not clear), and probably because of that don't understand 
why they need to be removed in a method called "merge".
* Snapshot interval: I guess if datastore GC is very fast, we would like to 
call it often (maybe every hour or so), and expect binaries are removed 
quickly. That in mind, DEFAULT_BLOB_SNAPSHOT_INTERVAL every 12 hours seems to 
be too long to me. I would probably use 1 hour or so.
* I'm not sure if snapshot is needed. I would probably just switch to the next 
generation once per hour. Merging should be able to merge in-process files as 
well. Only old generation files can be deleted. Merging I would do into a new 
temp file, not to the current file, so that instanceId comparisons are not 
needed.
* static class Store: we have many "Store" classes now, I would prefer a 
clearer name (BlobReferenceStore for example).
* merge(List<File> refFiles, boolean doSort) clears the list is a side effect. 
Why? In any case this is not obvious.
* fileTreeTraverser().breadthFirstTraversal - not sure where and why it would 
process subdirectories. Couldn't this include all binaries, which would be slow 
for the FileDataStore?
* removeRecords: forceDelete(recs) is done in a finally block, meaning even if 
there were exceptions before that. Probably a bug. In general, the code doesn't 
seem to be tested for cases where exceptions occur (disk full, out of memory, 
file systems that sometimes throw timeout exceptions like NFS). This is not 
robust. We need to test for those cases.
* Typo: "Deleting blcloob"

> [BlobGC] Improve collection times of blobs available
> ----------------------------------------------------
>
>                 Key: OAK-4200
>                 URL: https://issues.apache.org/jira/browse/OAK-4200
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>            Reporter: Amit Jain
>            Assignee: Amit Jain
>             Fix For: 1.6
>
>
> The blob collection phase (Identifying all the blobs available in the data 
> store) is quite an expensive part of the whole GC process, taking up a few 
> hours sometimes on large repositories, due to iteration of the sub-folders in 
> the data store.
> In an offline discussion with [~tmueller] and [~chetanm], the idea came up 
> that this phase can be faster if
> *  Blobs ids are tracked when the blobs are added for e.g. in a simple file 
> in the datastore per cluster node.
> * GC then consolidates this file from all the cluster nodes and uses it to 
> get the candidates for GC.
> * This variant of the MarkSweepGC can be triggered  more frequently. It would 
> be ok to miss blob id additions to this file during a crash etc., as these 
> blobs can be cleaned up in the *regular* MarkSweepGC cycles triggered 
> occasionally.
> We also may be able to track other metadata along with the blob ids like 
> paths, timestamps etc. for auditing/analytics, in-conjunction with OAK-3140.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-4200) [BlobGC] Improve collection times of blobs available

Reply via email to