[ https://issues.apache.org/jira/browse/JCRVLT-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18008650#comment-18008650 ]
Konrad Windszus commented on JCRVLT-810: ---------------------------------------- I think the bq. 14 regex checks per property is the point which can probably be generally improved with either caching the result and/or removing redundant checks. One check per property should be sufficient. > Checking workspacefilter slowing down packaging import > ------------------------------------------------------ > > Key: JCRVLT-810 > URL: https://issues.apache.org/jira/browse/JCRVLT-810 > Project: Jackrabbit FileVault > Issue Type: Task > Affects Versions: 4.0.0 > Reporter: Joerg Hoh > Priority: Major > > I am investigating how to improve the performance of importing content > packages in AEM, which contain about 60 individual pages with ~ 1500 nodes > each (and ~ 14k properties in these 1500 nodes). These content packages are > created by filevault and imported by filevault. The filter.xml looks like > this: > {noformat} > <filter root="/content/fooo/bar"> > <include pattern="/\Qcontent/foo/bar\E"/> > <include pattern="\Q/content/foo/bar\E/.*"/> > <exclude pattern=".*rep:policy"/> > <exclude pattern=".*rep:repoPolicy"/> > <exclude pattern="^.*/cq:lastReplicated.*" matchProperties="true"/> > <exclude pattern="^.*/cq:lastReplicatedBy.*" matchProperties="true"/> > <exclude pattern="^.*/cq:lastReplicationAction.*" > matchProperties="true"/> > <exclude pattern="^.*/cq:isDelivered.*" matchProperties="true"/> > <exclude pattern="^.*/jcr:isCheckedOut.*" matchProperties="true"/> > <exclude pattern="^.*/jcr:baseVersion.*" matchProperties="true"/> > <exclude pattern="^.*/jcr:predecessors.*" matchProperties="true"/> > <exclude pattern="^.*/jcr:versionHistory.*" matchProperties="true"/> > <exclude pattern="^.*/jcr:activity.*" matchProperties="true"/> > <exclude pattern="^.*/jcr:configuration.*" matchProperties="true"/> > </filter> > [the same structure for the other 59 pages in this package] > {noformat} > During this investigation I already applied a series of improvements (not all > yet reported, and not all yet committed), and I was able to bring the time > down to 76 seconds. > Now I see a lot of situations where the stack looks like this: > {noformat} > at > java.util.regex.Pattern$CharPropertyGreedy.match(java.base@21.0.2/Pattern.java:4461) > at > java.util.regex.Pattern$Begin.match(java.base@21.0.2/Pattern.java:3851) > at java.util.regex.Matcher.match(java.base@21.0.2/Matcher.java:1794) > at java.util.regex.Matcher.matches(java.base@21.0.2/Matcher.java:754) > at > org.apache.jackrabbit.vault.fs.filter.DefaultPathFilter.matches(DefaultPathFilter.java:92) > at > org.apache.jackrabbit.vault.fs.api.PathFilterSet.contains(PathFilterSet.java:103) > at > org.apache.jackrabbit.vault.fs.config.DefaultWorkspaceFilter.includesProperty(DefaultWorkspaceFilter.java:273) > at > org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.setUnprotectedProperties(DocViewImporter.java:1280) > at > org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.createNewNode(DocViewImporter.java:1182) > at > org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.addNode(DocViewImporter.java:931) > at > org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.startDocViewNode(DocViewImporter.java:410) > at > org.apache.jackrabbit.vault.fs.impl.io.DocViewSAXHandler.startElement(DocViewSAXHandler.java:353) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(java.xml@21.0.2/AbstractSAXParser.java:518) > {noformat} > Here it seems that a lot of time is spent in checking the WorkspaceFilter, if > the nodes in the content packages are actually covered by the filters and are > therefor allowed to be imported. Given the specific circumstances in this > case, this is consistently true. > To assess the potential impact of an improvement in this case, I shortcut the > logic in {{DefaultWorkspaceFilter.includesProperty}} and let it return > consistently {{true}}. With this change in my test I was able to bring down > the import time of the package to 52s, which is an improvement of more than > 30%! Which shows that there is a large potential impact in improving this > logic, but I don't see how this can be improved in a significant way (the > regex is already compiled), as this means that during this package > installation we do 14 regex checks per property * 15k properties per page * > 60 pages = 12.6M regex matches. But getting rid of these checks in the > general case is also not possible. > For that reason I am thinking about introducing a new flag to the > ImportOptions, in which the the code invoking the installation of a package > can request that these checks are ignored. This information is then passed > down to the WorkspaceFilter, which can take that into consideration and ignore > Which in my specific is possible, as I control both the creation of the > packages, the transport to the consumer's side and also the code triggering > the import the package. -- This message was sent by Atlassian Jira (v8.20.10#820010)