Joerg Hoh created JCRVLT-810: -------------------------------- Summary: Checking workspacefilter slowing down packaging import Key: JCRVLT-810 URL: https://issues.apache.org/jira/browse/JCRVLT-810 Project: Jackrabbit FileVault Issue Type: Task Affects Versions: 4.0.0 Reporter: Joerg Hoh
I am investigating how to improve the performance of importing content packages in AEM, which contain about 60 individual pages with ~ 1500 nodes each (and ~ 14k properties in these 1500 nodes). These content packages are created by filevault and imported by filevault. The filter.xml looks like this: {noformat} <filter root="/content/fooo/bar"> <include pattern="/\Qcontent/foo/bar\E"/> <include pattern="\Q/content/foo/bar\E/.*"/> <exclude pattern=".*rep:policy"/> <exclude pattern=".*rep:repoPolicy"/> <exclude pattern="^.*/cq:lastReplicated.*" matchProperties="true"/> <exclude pattern="^.*/cq:lastReplicatedBy.*" matchProperties="true"/> <exclude pattern="^.*/cq:lastReplicationAction.*" matchProperties="true"/> <exclude pattern="^.*/cq:isDelivered.*" matchProperties="true"/> <exclude pattern="^.*/jcr:isCheckedOut.*" matchProperties="true"/> <exclude pattern="^.*/jcr:baseVersion.*" matchProperties="true"/> <exclude pattern="^.*/jcr:predecessors.*" matchProperties="true"/> <exclude pattern="^.*/jcr:versionHistory.*" matchProperties="true"/> <exclude pattern="^.*/jcr:activity.*" matchProperties="true"/> <exclude pattern="^.*/jcr:configuration.*" matchProperties="true"/> </filter> [the same structure for the other 59 pages in this package] {noformat} During this investigation I already applied a series of improvements (not all yet reported, and not all yet committed), and I was able to bring the time down to 76 seconds. Now I see a lot of situations where the stack looks like this: {noformat} at java.util.regex.Pattern$CharPropertyGreedy.match(java.base@21.0.2/Pattern.java:4461) at java.util.regex.Pattern$Begin.match(java.base@21.0.2/Pattern.java:3851) at java.util.regex.Matcher.match(java.base@21.0.2/Matcher.java:1794) at java.util.regex.Matcher.matches(java.base@21.0.2/Matcher.java:754) at org.apache.jackrabbit.vault.fs.filter.DefaultPathFilter.matches(DefaultPathFilter.java:92) at org.apache.jackrabbit.vault.fs.api.PathFilterSet.contains(PathFilterSet.java:103) at org.apache.jackrabbit.vault.fs.config.DefaultWorkspaceFilter.includesProperty(DefaultWorkspaceFilter.java:273) at org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.setUnprotectedProperties(DocViewImporter.java:1280) at org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.createNewNode(DocViewImporter.java:1182) at org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.addNode(DocViewImporter.java:931) at org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.startDocViewNode(DocViewImporter.java:410) at org.apache.jackrabbit.vault.fs.impl.io.DocViewSAXHandler.startElement(DocViewSAXHandler.java:353) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(java.xml@21.0.2/AbstractSAXParser.java:518) {noformat} Here it seems that a lot of time is spent in checking the WorkspaceFilter, if the nodes in the content packages are actually covered by the filters and are therefor allowed to be imported. Given the specific circumstances in this case, this is consistently true. To assess the potential impact of an improvement in this case, I shortcut the logic in {{DefaultWorkspaceFilter.includesProperty}} and let it return consistently {{true}}. With this change in my test I was able to bring down the import time of the package to 52s, which is an improvement of more than 30%! Which shows that there is a large potential impact in improving this logic, but I don't see how this can be improved in a significant way (the regex is already compiled), as this means that during this package installation we do 14 regex checks per property * 15k properties per page * 60 pages = 12.6M regex matches. But getting rid of these checks in the general case is also not possible. For that reason I am thinking about introducing a new flag to the ImportOptions, in which the the code invoking the installation of a package can request that these checks are ignored. This information is then passed down to the WorkspaceFilter, which can take that into consideration and ignore Which in my specific is possible, as I control both the creation of the packages, the transport to the consumer's side and also the code triggering the import the package. -- This message was sent by Atlassian Jira (v8.20.10#820010)