Hello, the saga continues... I've wrote a small robot to traverse the
versions storage and find the unreferenced version histories (actually, it
finds all nt:versionHistory nodes whose jcr:versionableUuid point to a
non-stored-anymore node).
Using an old-but-still-working-at-least-with-JR-2.6 Java GUI tool to browse
and manipulate the repository, I've deleted some versionable nodes to
create the orphans.

The full code (minus the copyright) for the robot is this:

package com.calenco.core.robot;

import com.calenco.storage.PooledRepoManager;
import java.util.TimerTask;
import javax.jcr.ItemNotFoundException;
import javax.jcr.Node;
import javax.jcr.NodeIterator;
import javax.jcr.RepositoryException;
import javax.jcr.Session;
import org.apache.jackrabbit.JcrConstants;
import org.apache.log4j.Logger;

/**
 *
 * Task that will periodically check orphaned revision history nodes
 * (nt:versionHistory) on the jcr:versionStorage tree and delete the orphans
 * found. Ideally, it should not find any orphan... but V3.1 and previous
 * Calenco versions left some orphaned revision history nodes on certain
Eraser
 * extension operations.
 *
 * @author fabman
 */
public class JcrHistoryCleanerTimerTask extends TimerTask {

    public static final Logger LOG =
Logger.getLogger(JcrHistoryCleanerTimerTask.class);
    private static final PooledRepoManager REPOMGR =
PooledRepoManager.getInstance();

    public JcrHistoryCleanerTimerTask() {
        super();
    }

    @Override
    public void run() {
        LOG.info(String.format("%s running...", getClass()));
        Session nsession = null;
        try {
            nsession = REPOMGR.getSessionRW();
            processNode(nsession,
nsession.getNode("/jcr:system/jcr:versionStorage"));
            if (nsession.hasPendingChanges()) {
                nsession.save();
            }
        } catch (RepositoryException ex) {
            LOG.warn(String.format("Got %s while trying to clean revision
history tree", ex), ex);
        } finally {
            REPOMGR.releaseSessionRW(nsession);
        }
    }

    private void processNode(Session nsession, Node node) throws
RepositoryException {
        String npath = node.getPath();
        String ntype = node.getPrimaryNodeType().getName();
        LOG.info(String.format("Processing node %s [%s]...", npath, ntype));
        if (null == ntype) {
            LOG.warn(String.format("!!! Skipping node %s of type %s",
npath, ntype));
        } else {
            switch (ntype) {
                case "rep:versionStorage":
                    // recurse...
                    NodeIterator nit = node.getNodes();
                    while (nit.hasNext()) {
                        processNode(nsession, nit.nextNode());
                    }
                    break;
                case JcrConstants.NT_VERSIONHISTORY:
                    String vuuid =
node.getProperty(JcrConstants.JCR_VERSIONABLEUUID).getString();
                    if (!nodeExists(nsession, vuuid)) {
                        node.remove();
                        LOG.info(String.format(">>> Node with UUID %s is
not stored anymore... its revision history has been deleted...", vuuid));
                    }
                    break;
                default:
                    LOG.warn(String.format("!!! Skipping node %s of type
%s", npath, ntype));
                    break;
            }
        }
    }

    private boolean nodeExists(Session nsession, String uuid) {
        try {
            nsession.getNodeByIdentifier(uuid);
            return true;
        } catch (ItemNotFoundException ignoredReturningFalse) {
            return false;
        } catch (RepositoryException ex) {
            LOG.warn(String.format("!!! Got unexpected %s while checking if
node with UUID %s exists", ex.getMessage(), uuid));
            return false;
        }
    }
}

When the robot is run, I get this on the log:

2018-03-16 08:43:06,444  INFO
com.calenco.core.robot.JcrHistoryCleanerTimerTask:42 - class
com.calenco.core.robot.JcrHistoryCleanerTimerTask running...

2018-03-16 08:43:06,445  INFO
com.calenco.core.robot.JcrHistoryCleanerTimerTask:60 - Processing node
/jcr:system/jcr:versionStorage [rep:versionStorage]...

2018-03-16 08:43:06,448  INFO
com.calenco.core.robot.JcrHistoryCleanerTimerTask:60 - Processing node
/jcr:system/jcr:versionStorage/78 [rep:versionStorage]...

2018-03-16 08:43:06,449  INFO
com.calenco.core.robot.JcrHistoryCleanerTimerTask:60 - Processing node
/jcr:system/jcr:versionStorage/78/11 [rep:versionStorage]...

2018-03-16 08:43:06,451  INFO
com.calenco.core.robot.JcrHistoryCleanerTimerTask:60 - Processing node
/jcr:system/jcr:versionStorage/78/11/fe [rep:versionStorage]...

2018-03-16 08:43:06,454  INFO
com.calenco.core.robot.JcrHistoryCleanerTimerTask:60 - Processing node
/jcr:system/jcr:versionStorage/78/11/fe/7811fe34-0aae-432a-9015-2626ee77bced
[nt:versionHistory]...

2018-03-16 08:43:06,457  INFO
com.calenco.core.robot.JcrHistoryCleanerTimerTask:60 - Processing node
/jcr:system/jcr:versionStorage/78/da [rep:versionStorage]...

2018-03-16 08:43:06,459  INFO
com.calenco.core.robot.JcrHistoryCleanerTimerTask:60 - Processing node
/jcr:system/jcr:versionStorage/78/da/be [rep:versionStorage]...

2018-03-16 08:43:06,460  INFO
com.calenco.core.robot.JcrHistoryCleanerTimerTask:60 - Processing node
/jcr:system/jcr:versionStorage/78/da/be/78dabe1c-ead8-414d-8455-4ece9412c1a6
[nt:versionHistory]...


[LOTS_OF_RECURSIVE_TRAVERSAL_RELATED_STUFF_REMOVED_FROM_HERE]


2018-03-16 08:43:06,737  INFO
com.calenco.core.robot.JcrHistoryCleanerTimerTask:60 - Processing node
/jcr:system/jcr:versionStorage/e9 [rep:versionStorage]...

2018-03-16 08:43:06,739  INFO
com.calenco.core.robot.JcrHistoryCleanerTimerTask:60 - Processing node
/jcr:system/jcr:versionStorage/e9/07 [rep:versionStorage]...

2018-03-16 08:43:06,739  INFO
com.calenco.core.robot.JcrHistoryCleanerTimerTask:60 - Processing node
/jcr:system/jcr:versionStorage/e9/07/ed [rep:versionStorage]...

2018-03-16 08:43:06,740  INFO
com.calenco.core.robot.JcrHistoryCleanerTimerTask:60 - Processing node
/jcr:system/jcr:versionStorage/e9/07/ed/e907edad-ad75-4572-ba44-56a9f5741fb9
[nt:versionHistory]...

2018-03-16 08:43:06,743  WARN
com.calenco.core.robot.JcrHistoryCleanerTimerTask:51 - Got
javax.jcr.nodetype.ConstraintViolationException: Unable to perform
operation. Node is protected. while trying to clean revision history tree

javax.jcr.nodetype.ConstraintViolationException: Unable to perform
operation. Node is protected.

at
org.apache.jackrabbit.core.ItemValidator.checkCondition(ItemValidator.java:276)

at
org.apache.jackrabbit.core.ItemValidator.checkRemove(ItemValidator.java:254)

at
org.apache.jackrabbit.core.ItemRemoveOperation.perform(ItemRemoveOperation.java:63)

at
org.apache.jackrabbit.core.session.SessionState.perform(SessionState.java:200)

at org.apache.jackrabbit.core.ItemImpl.perform(ItemImpl.java:91)

at org.apache.jackrabbit.core.ItemImpl.remove(ItemImpl.java:322)

at
com.calenco.core.robot.JcrHistoryCleanerTimerTask.processNode(JcrHistoryCleanerTimerTask.java:75)

at
com.calenco.core.robot.JcrHistoryCleanerTimerTask.processNode(JcrHistoryCleanerTimerTask.java:69)

at
com.calenco.core.robot.JcrHistoryCleanerTimerTask.processNode(JcrHistoryCleanerTimerTask.java:69)

at
com.calenco.core.robot.JcrHistoryCleanerTimerTask.processNode(JcrHistoryCleanerTimerTask.java:69)

at
com.calenco.core.robot.JcrHistoryCleanerTimerTask.processNode(JcrHistoryCleanerTimerTask.java:69)

at
com.calenco.core.robot.JcrHistoryCleanerTimerTask.run(JcrHistoryCleanerTimerTask.java:46)

at java.util.TimerThread.mainLoop(Timer.java:555)

at java.util.TimerThread.run(Timer.java:505)

I've removed lots of log entries for nodes that don't need processing so
you don't have to scroll down that much to get to the interesting part. The
removed entries are basically the recursive traversal of the revision
history tree nodes. Version storage for node with UUID
e907edad-ad75-4572-ba44-56a9f5741fb9 is an orphan, there's no such node
stored anymore.

So, it seems that (soundly, but won't let me remove what I need) the tree
under /jcr:system/jcr:versionStorage is protected in a way that some of the
"regular" node manipulation API (node.remove()) cannot be used on those
nodes.

Our application has a JCR garbage collector programmed to run. I let it run:

2018-03-16 09:00:00,205  INFO com.calenco.storage.PooledRepoManager:757 -
Starting repository garbage collection...

2018-03-16 09:00:00,456  INFO com.calenco.storage.PooledRepoManager:770 -
Finished repository garbage collection.

Then stopped the application and used the tool to inspect the repository...
I still find the orphan version history nodes, which take up useless disk
space and all their references are actually pointing nowhere (to no stored
node, that's what defines them as orphans in the 1st place...). So the
orphaned version history nodes are not eligible for garbage collection (one
of my original questions, answered now).

Even if I know disk space is cheap these days, I don't want JCR to store
useless garbage for my application's data (our current repository is 60GB
and growing...)

Is there a way to manipulate those nodes under
/jcr:system/jcr:versionStorage with the JCR API so I can remove the orphans?

Thanks once more for your help, and sorry for the long, detailed, message.

Best regards.


On Thu, Mar 15, 2018 at 7:54 AM, Julian Reschke <[email protected]>
wrote:

> On 2018-03-15 11:47, Fabián Mandelbaum wrote:
>
>> I'll try to provide the details for dependency hell with 2.6 and other
>> frameworks/libs later, I'm a bit overloaded with other things now, sorry.
>>
>
> Thanks.
>
> Nothing happens with the tree, OK, but is it eligible to be
>> garbage-collected or not?
>>
>
> I wouldn't call it "garbage" :-) But yes, it should be possible to delete
> those.
>
> Indeed, if you try to delete a referenced node you get a
>> ReferentialIntegrityException, which is fine. Now, I'll be fine with
>> writing a robot (periodically-run task) to find all such orphans and wipe
>> them. How can I detect those orphans SAFELY (in a way that attempting to
>> delete them will not throw ReferentialIntegrityException, and will not
>> break my content repository in any way). Is there a ready-made utility
>> method in the API or a query, or set of queries, I can run to detect those
>> orphans?
>>
>> Thanks again for your help.
>>
>
> I'm not sure there's a way to write a query for this.
>
> You should be able to traverse the versions storage and find the
> unreferenced version histories. That said, why don't you simply collect the
> paths while deleting the version-controlled nodes?
>
> Best regards, Julian
>



-- 
Fabián Mandelbaum
IS Engineer

Reply via email to