[ https://issues.apache.org/jira/browse/OAK-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333451#comment-15333451 ]
Thomas Mueller commented on OAK-4200: ------------------------------------- * There are lots of "import static". I would prefer not to do that, and use regular imports instead. * FileLineDifferenceIterator and so on don't seem to support strings with newline (and possibly other characters such as \r). This needs to be either (a) preferred: implemented and tested, or (b) documented, and the code needs to be reviewed to ensure such characters can never occur. * String instanceId = String.valueOf(currentTimeMillis()) - it looks like this id needs to be globally unique. But using the current time millis is not enough to ensure that. A UUID needs to be used instead, or in addition, to the time. * DataRecord input; input.getIdentifier().toString().contains(instanceId): that assumes instanceId is more than just a counter, otherwise it can give false positives. But nowhere is it obvious that instanceId _can't_ be a counter. So this is a hack in my view, and far too dangerous. * "The iterator returned ia a Closeable instance". I would prefer using a new interface instead of relying on "instanceof" and casting. * globalMerge deletes metadata records. I'm not sure what those metdata records are (documentation is not clear), and probably because of that don't understand why they need to be removed in a method called "merge". * Snapshot interval: I guess if datastore GC is very fast, we would like to call it often (maybe every hour or so), and expect binaries are removed quickly. That in mind, DEFAULT_BLOB_SNAPSHOT_INTERVAL every 12 hours seems to be too long to me. I would probably use 1 hour or so. * I'm not sure if snapshot is needed. I would probably just switch to the next generation once per hour. Merging should be able to merge in-process files as well. Only old generation files can be deleted. Merging I would do into a new temp file, not to the current file, so that instanceId comparisons are not needed. * static class Store: we have many "Store" classes now, I would prefer a clearer name (BlobReferenceStore for example). * merge(List<File> refFiles, boolean doSort) clears the list is a side effect. Why? In any case this is not obvious. * fileTreeTraverser().breadthFirstTraversal - not sure where and why it would process subdirectories. Couldn't this include all binaries, which would be slow for the FileDataStore? * removeRecords: forceDelete(recs) is done in a finally block, meaning even if there were exceptions before that. Probably a bug. In general, the code doesn't seem to be tested for cases where exceptions occur (disk full, out of memory, file systems that sometimes throw timeout exceptions like NFS). This is not robust. We need to test for those cases. * Typo: "Deleting blcloob" > [BlobGC] Improve collection times of blobs available > ---------------------------------------------------- > > Key: OAK-4200 > URL: https://issues.apache.org/jira/browse/OAK-4200 > Project: Jackrabbit Oak > Issue Type: Improvement > Reporter: Amit Jain > Assignee: Amit Jain > Fix For: 1.6 > > > The blob collection phase (Identifying all the blobs available in the data > store) is quite an expensive part of the whole GC process, taking up a few > hours sometimes on large repositories, due to iteration of the sub-folders in > the data store. > In an offline discussion with [~tmueller] and [~chetanm], the idea came up > that this phase can be faster if > * Blobs ids are tracked when the blobs are added for e.g. in a simple file > in the datastore per cluster node. > * GC then consolidates this file from all the cluster nodes and uses it to > get the candidates for GC. > * This variant of the MarkSweepGC can be triggered more frequently. It would > be ok to miss blob id additions to this file during a crash etc., as these > blobs can be cleaned up in the *regular* MarkSweepGC cycles triggered > occasionally. > We also may be able to track other metadata along with the blob ids like > paths, timestamps etc. for auditing/analytics, in-conjunction with OAK-3140. -- This message was sent by Atlassian JIRA (v6.3.4#6332)