Joerg Hoh created JCRVLT-810:
--------------------------------

             Summary: Checking workspacefilter slowing down packaging import
                 Key: JCRVLT-810
                 URL: https://issues.apache.org/jira/browse/JCRVLT-810
             Project: Jackrabbit FileVault
          Issue Type: Task
    Affects Versions: 4.0.0
            Reporter: Joerg Hoh


I am investigating how to improve the performance of importing content packages 
in AEM, which contain about 60 individual pages with ~ 1500 nodes each (and ~ 
14k properties in these 1500 nodes). These content packages are created by 
filevault and imported by filevault. The filter.xml looks like this:

{noformat}
    <filter root="/content/fooo/bar">
        <include pattern="/\Qcontent/foo/bar\E"/>
        <include pattern="\Q/content/foo/bar\E/.*"/>
        <exclude pattern=".*rep:policy"/>
        <exclude pattern=".*rep:repoPolicy"/>
        <exclude pattern="^.*/cq:lastReplicated.*" matchProperties="true"/>
        <exclude pattern="^.*/cq:lastReplicatedBy.*" matchProperties="true"/>
        <exclude pattern="^.*/cq:lastReplicationAction.*" 
matchProperties="true"/>
        <exclude pattern="^.*/cq:isDelivered.*" matchProperties="true"/>
        <exclude pattern="^.*/jcr:isCheckedOut.*" matchProperties="true"/>
        <exclude pattern="^.*/jcr:baseVersion.*" matchProperties="true"/>
        <exclude pattern="^.*/jcr:predecessors.*" matchProperties="true"/>
        <exclude pattern="^.*/jcr:versionHistory.*" matchProperties="true"/>
        <exclude pattern="^.*/jcr:activity.*" matchProperties="true"/>
        <exclude pattern="^.*/jcr:configuration.*" matchProperties="true"/>
    </filter>
    [the same structure for the other 59 pages in this package]
{noformat}

During this investigation I already applied a series of improvements (not all 
yet reported, and not all yet committed), and I was able to bring the time down 
to 76 seconds.

Now I see a lot of situations where the stack looks like this:

{noformat}
        at 
java.util.regex.Pattern$CharPropertyGreedy.match(java.base@21.0.2/Pattern.java:4461)
        at 
java.util.regex.Pattern$Begin.match(java.base@21.0.2/Pattern.java:3851)
        at java.util.regex.Matcher.match(java.base@21.0.2/Matcher.java:1794)
        at java.util.regex.Matcher.matches(java.base@21.0.2/Matcher.java:754)
        at 
org.apache.jackrabbit.vault.fs.filter.DefaultPathFilter.matches(DefaultPathFilter.java:92)
        at 
org.apache.jackrabbit.vault.fs.api.PathFilterSet.contains(PathFilterSet.java:103)
        at 
org.apache.jackrabbit.vault.fs.config.DefaultWorkspaceFilter.includesProperty(DefaultWorkspaceFilter.java:273)
        at 
org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.setUnprotectedProperties(DocViewImporter.java:1280)
        at 
org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.createNewNode(DocViewImporter.java:1182)
        at 
org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.addNode(DocViewImporter.java:931)
        at 
org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.startDocViewNode(DocViewImporter.java:410)
        at 
org.apache.jackrabbit.vault.fs.impl.io.DocViewSAXHandler.startElement(DocViewSAXHandler.java:353)
        at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(java.xml@21.0.2/AbstractSAXParser.java:518)
{noformat}

Here it seems that a lot of time is spent in checking the WorkspaceFilter, if 
the nodes in the content packages are actually covered by the filters and are 
therefor allowed to be imported. Given the specific circumstances in this case, 
this is consistently true.

To assess the potential impact of an improvement in this case, I shortcut the 
logic in {{DefaultWorkspaceFilter.includesProperty}} and let it return 
consistently {{true}}. With this change in my test I was able to bring down the 
import time of the package to 52s, which is an improvement of more than 30%! 
Which shows that there is a large potential impact in improving this logic, but 
I don't see how this can be improved in a significant way (the regex is already 
compiled), as this means that during this package installation we do 14 regex 
checks per property * 15k properties per page * 60 pages = 12.6M regex matches. 
But getting rid of these checks in the general case is also not possible.

For that reason I am thinking about introducing a new flag to the 
ImportOptions, in which the the code invoking the installation of a package can 
request that these checks are ignored. This information is then passed down to 
the WorkspaceFilter, which can take that into consideration and ignore

Which in my specific is possible, as I control both the creation of the 
packages, the transport to the consumer's side and also the code triggering the 
import the package.

















--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to