Joerg Hoh created JCRVLT-810:
--------------------------------
Summary: Checking workspacefilter slowing down packaging import
Key: JCRVLT-810
URL: https://issues.apache.org/jira/browse/JCRVLT-810
Project: Jackrabbit FileVault
Issue Type: Task
Affects Versions: 4.0.0
Reporter: Joerg Hoh
I am investigating how to improve the performance of importing content packages
in AEM, which contain about 60 individual pages with ~ 1500 nodes each (and ~
14k properties in these 1500 nodes). These content packages are created by
filevault and imported by filevault. The filter.xml looks like this:
{noformat}
<filter root="/content/fooo/bar">
<include pattern="/\Qcontent/foo/bar\E"/>
<include pattern="\Q/content/foo/bar\E/.*"/>
<exclude pattern=".*rep:policy"/>
<exclude pattern=".*rep:repoPolicy"/>
<exclude pattern="^.*/cq:lastReplicated.*" matchProperties="true"/>
<exclude pattern="^.*/cq:lastReplicatedBy.*" matchProperties="true"/>
<exclude pattern="^.*/cq:lastReplicationAction.*"
matchProperties="true"/>
<exclude pattern="^.*/cq:isDelivered.*" matchProperties="true"/>
<exclude pattern="^.*/jcr:isCheckedOut.*" matchProperties="true"/>
<exclude pattern="^.*/jcr:baseVersion.*" matchProperties="true"/>
<exclude pattern="^.*/jcr:predecessors.*" matchProperties="true"/>
<exclude pattern="^.*/jcr:versionHistory.*" matchProperties="true"/>
<exclude pattern="^.*/jcr:activity.*" matchProperties="true"/>
<exclude pattern="^.*/jcr:configuration.*" matchProperties="true"/>
</filter>
[the same structure for the other 59 pages in this package]
{noformat}
During this investigation I already applied a series of improvements (not all
yet reported, and not all yet committed), and I was able to bring the time down
to 76 seconds.
Now I see a lot of situations where the stack looks like this:
{noformat}
at
java.util.regex.Pattern$CharPropertyGreedy.match([email protected]/Pattern.java:4461)
at
java.util.regex.Pattern$Begin.match([email protected]/Pattern.java:3851)
at java.util.regex.Matcher.match([email protected]/Matcher.java:1794)
at java.util.regex.Matcher.matches([email protected]/Matcher.java:754)
at
org.apache.jackrabbit.vault.fs.filter.DefaultPathFilter.matches(DefaultPathFilter.java:92)
at
org.apache.jackrabbit.vault.fs.api.PathFilterSet.contains(PathFilterSet.java:103)
at
org.apache.jackrabbit.vault.fs.config.DefaultWorkspaceFilter.includesProperty(DefaultWorkspaceFilter.java:273)
at
org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.setUnprotectedProperties(DocViewImporter.java:1280)
at
org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.createNewNode(DocViewImporter.java:1182)
at
org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.addNode(DocViewImporter.java:931)
at
org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.startDocViewNode(DocViewImporter.java:410)
at
org.apache.jackrabbit.vault.fs.impl.io.DocViewSAXHandler.startElement(DocViewSAXHandler.java:353)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement([email protected]/AbstractSAXParser.java:518)
{noformat}
Here it seems that a lot of time is spent in checking the WorkspaceFilter, if
the nodes in the content packages are actually covered by the filters and are
therefor allowed to be imported. Given the specific circumstances in this case,
this is consistently true.
To assess the potential impact of an improvement in this case, I shortcut the
logic in {{DefaultWorkspaceFilter.includesProperty}} and let it return
consistently {{true}}. With this change in my test I was able to bring down the
import time of the package to 52s, which is an improvement of more than 30%!
Which shows that there is a large potential impact in improving this logic, but
I don't see how this can be improved in a significant way (the regex is already
compiled), as this means that during this package installation we do 14 regex
checks per property * 15k properties per page * 60 pages = 12.6M regex matches.
But getting rid of these checks in the general case is also not possible.
For that reason I am thinking about introducing a new flag to the
ImportOptions, in which the the code invoking the installation of a package can
request that these checks are ignored. This information is then passed down to
the WorkspaceFilter, which can take that into consideration and ignore
Which in my specific is possible, as I control both the creation of the
packages, the transport to the consumer's side and also the code triggering the
import the package.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)