[ https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15904841#comment-15904841 ]
Stefan Eissing commented on OAK-4780: ------------------------------------- Result from run of current version against a large customer database: h4. Run Timings {code} Started: 2017-03-09 18:39:22.787 Ended: 2017-03-10 06:56:01.721 Duration: 12.28 hours Stats: VersionGCStats{ignoredGCDueToCheckPoint=true, canceled=true, deletedDocGCCount=2011642 (of which leaf: 687880), updateResurrectedGCCount=25229600, splitDocGCCount=92, intermediateSplitDocGCCount=0, iterationCount=60, timeActive=12,28 h, timeToCollectDeletedDocs=0,000 ns, timeToCheckDeletedDocs=0,000 ns, timeToSortDocIds=0,000 ns, timeTakenToUpdateResurrectedDocs=0,000 ns, timeTakenToDeleteDeletedDocs=0,000 ns, timeTakenToCollectAndDeleteSplitDocs=0,000 ns} {code} _(stopwatches are not accumulative over runs atm, so they show 0 for the total)_ The RGC run was initiated by: {code} java -jar target/oak-run-1.8-SNAPSHOT.jar revisions mongodb://localhost/<dbname> collect {code} It performed 60 iterations over almost 12 hours (the database has ~250 million nodes of which 11% had {{_deletedOnce}} set), starting with: {code} 18:39:22.789 INFO start 1. run (avg duration 0.0 sec) 18:39:22.792 DEBUG No lastOldestTimestamp found, querying for the oldest deletedOnce candidate 18:39:37.861 DEBUG lastOldestTimestamp found: 2016-10-20 17:47:49.999 18:39:50.491 DEBUG deletedOnce candidates: 27241631 found, 95000 preferred, scope now [2016-10-20 17:47:49.999, 2016-10-21 05:31:15.835] ... 18:39:54.469 DEBUG successful run using 0.0% of limit, raising recommended interval to 63308 seconds 18:39:54.471 INFO Revision garbage collection finished in 3,973 s. VersionGCStats{ignoredGCDueToCheckPoint=false, canceled=false, deletedDocGCCount=0 (of which leaf: 0), updateResurrectedGCCount=9, splitDocGCCount=0, intermediateSplitDocGCCount=0, iterationCount=0, timeActive=31,68 s, timeToCollectDeletedDocs=3,935 s, timeToCheckDeletedDocs=2,953 ms, timeToSortDocIds=0,000 ns, timeTakenToUpdateResurrectedDocs=19,46 ms, timeTakenToDeleteDeletedDocs=0,000 ns, timeTakenToCollectAndDeleteSplitDocs=13,27 ms} {code} until the last one: {code} 06:56:01.717 INFO start 60. run (avg duration 749.118 sec) 06:56:01.717 DEBUG previous runs recommend a 68783 sec duration, scope now [2017-01-29 21:09:00.734, 2017-01-30 16:15:24.416] 06:56:01.717 WARN Ignoring RGC run because a valid checkpoint [revision: "r159ebd85640-0-4", clusterId: 4, time: "2017-01-29 21:09:00.736"] exists inside minimal scope [2017-01-29 21:09:00.734, 2017-01-29 21:10:00.734]. {code} h4. Time Interval Handling The runs enforced the default collection limit (100000), so that no disk file sorting was necessary. Because the delete candidates were not evenly distributed, it aborted 15 runs and halfed the time interval used. (those runs did still delete leaf nodes and reset _deletedOnce for visible documents encountered, so they are not "lost"). The time interval was again raised by a factor of 1.5 whenever the collected nodes (non-leaf) stayed below 66% of the limit. This happened initially on every run, as they encountered only nodes where _deletedOnce needed to be reset. Intervals grew from ~9 hours to 13 days, then needed to shrink to cleanup the lump of really deleted non-leaf docs, using 13.25 minutes as shortest scope for a run. In detail: {code} 18:39:54.469 used 0.0% of limit, rec. interval of 63308 seconds 18:39:54.485 used 0.0% of limit, rec. interval of 94963 seconds 18:39:54.497 used 0.0% of limit, rec. interval of 142444 seconds 18:39:54.509 used 0.0% of limit, rec. interval of 213667 seconds 18:44:01.693 used 0.0% of limit, rec. interval of 320500 seconds 18:44:01.924 used 0.0% of limit, rec. interval of 480750 seconds 18:52:36.998 used 0.0% of limit, rec. interval of 721126 seconds 19:11:03.765 used 0.0% of limit, rec. interval of 1081689 seconds 00:27:51.562 used 0.0% of limit, rec. interval of 1622534 seconds 01:08:25.855 used 0.0% of limit, rec. interval of 2433801 seconds 03:41:52.568 used 0.0% of limit, rec. interval of 3650701 seconds 06:37:03.949 limit exceeded, reducing interval to 762539 seconds 06:41:30.360 used 0.0% of limit, rec. interval of 1143809 seconds 06:42:27.120 limit exceeded, reducing interval to 381269 seconds 06:42:48.380 used 0.0% of limit, rec. interval of 571904 seconds 06:42:51.860 limit exceeded, reducing interval to 190634 seconds 06:43:06.386 limit exceeded, reducing interval to 95317 seconds 06:43:21.085 used 0.0% of limit, rec. interval of 142976 seconds 06:43:23.816 limit exceeded, reducing interval to 71488 seconds 06:43:40.364 used 7.8% of limit, rec. interval of 107232 seconds 06:43:43.166 limit exceeded, reducing interval to 53616 seconds 06:43:59.157 limit exceeded, reducing interval to 13404 seconds 06:43:55.608 limit exceeded, reducing interval to 26808 seconds 06:44:03.167 limit exceeded, reducing interval to 6702 seconds 06:44:20.790 used 22.4% of limit, rec. interval of 10053 seconds 06:44:26.467 limit exceeded, reducing interval to 5026 seconds 06:46:24.131 used 53.4% of limit, rec. interval of 7539 seconds 06:48:16.552 limit exceeded, reducing interval to 3769 seconds 06:50:30.220 used 18.1% of limit, rec. interval of 5654 seconds 06:50:49.457 limit exceeded, reducing interval to 2827 seconds 06:51:02.985 used 36.8% of limit, rec. interval of 4241 seconds 06:51:38.116 limit exceeded, reducing interval to 2120 seconds 06:51:56.500 limit exceeded, reducing interval to 1060 seconds 06:52:43.217 limit exceeded, reducing interval to 530 seconds 06:54:56.950 used 58.4% of limit, rec. interval of 795 seconds 06:55:08.149 used 38.4% of limit, rec. interval of 1192 seconds 06:55:14.209 used 11.4% of limit, rec. interval of 1789 seconds 06:55:16.615 used 4.0% of limit, rec. interval of 2683 seconds 06:55:19.418 used 4.6% of limit, rec. interval of 4025 seconds 06:55:25.049 used 10.4% of limit, rec. interval of 6038 seconds 06:55:50.735 used 45.6% of limit, rec. interval of 9057 seconds 06:55:51.774 used 1.4% of limit, rec. interval of 13586 seconds 06:55:56.915 used 5.3% of limit, rec. interval of 20380 seconds 06:55:57.854 used 1.0% of limit, rec. interval of 30570 seconds 06:55:58.011 used 0.0% of limit, rec. interval of 45855 seconds 06:55:58.160 used 0.0% of limit, rec. interval of 68783 seconds {code} h4. Index Performance Previous versions gave mongo the hint to use the _deletedOnce index explicitly. The version for this run had oak/trunk from yesterday integrated where those hints have been removed. Performance has increased drastically by this. This reflects for example in finding the oldest _deletedOnce resource on the first run: {code} 18:39:22.792 getOldestDeletedOnceTimestamp() <- start 18:39:37.861 getOldestDeletedOnceTimestamp() -> 2016-10-20 17:47:50.000 {code} This used to take ~3 minutes before. > VersionGarbageCollector should be able to run incrementally > ----------------------------------------------------------- > > Key: OAK-4780 > URL: https://issues.apache.org/jira/browse/OAK-4780 > Project: Jackrabbit Oak > Issue Type: Task > Components: core, documentmk > Reporter: Julian Reschke > Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff > > > Right now, the documentmk's version garbage collection runs in several phases. > It first collects the paths of candidate nodes, and only once this has been > successfully finished, starts actually deleting nodes. > This can be a problem when the regularly scheduled garbage collection is > interrupted during the path collection phase, maybe due to other maintenance > tasks. On the next run, the number of paths to be collected will be even > bigger, thus making it even more likely to fail. > We should think about a change in the logic that would allow the GC to run in > chunks; maybe by partitioning the path space by top level directory. -- This message was sent by Atlassian JIRA (v6.3.15#6346)