[ 
https://issues.apache.org/jira/browse/JCRVLT-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18008650#comment-18008650
 ] 

Konrad Windszus commented on JCRVLT-810:
----------------------------------------

I think the 
bq. 14 regex checks per property
is the point which can probably be generally improved with either caching the 
result and/or removing redundant checks. One check per property should be 
sufficient.

> Checking workspacefilter slowing down packaging import
> ------------------------------------------------------
>
>                 Key: JCRVLT-810
>                 URL: https://issues.apache.org/jira/browse/JCRVLT-810
>             Project: Jackrabbit FileVault
>          Issue Type: Task
>    Affects Versions: 4.0.0
>            Reporter: Joerg Hoh
>            Priority: Major
>
> I am investigating how to improve the performance of importing content 
> packages in AEM, which contain about 60 individual pages with ~ 1500 nodes 
> each (and ~ 14k properties in these 1500 nodes). These content packages are 
> created by filevault and imported by filevault. The filter.xml looks like 
> this:
> {noformat}
>     <filter root="/content/fooo/bar">
>         <include pattern="/\Qcontent/foo/bar\E"/>
>         <include pattern="\Q/content/foo/bar\E/.*"/>
>         <exclude pattern=".*rep:policy"/>
>         <exclude pattern=".*rep:repoPolicy"/>
>         <exclude pattern="^.*/cq:lastReplicated.*" matchProperties="true"/>
>         <exclude pattern="^.*/cq:lastReplicatedBy.*" matchProperties="true"/>
>         <exclude pattern="^.*/cq:lastReplicationAction.*" 
> matchProperties="true"/>
>         <exclude pattern="^.*/cq:isDelivered.*" matchProperties="true"/>
>         <exclude pattern="^.*/jcr:isCheckedOut.*" matchProperties="true"/>
>         <exclude pattern="^.*/jcr:baseVersion.*" matchProperties="true"/>
>         <exclude pattern="^.*/jcr:predecessors.*" matchProperties="true"/>
>         <exclude pattern="^.*/jcr:versionHistory.*" matchProperties="true"/>
>         <exclude pattern="^.*/jcr:activity.*" matchProperties="true"/>
>         <exclude pattern="^.*/jcr:configuration.*" matchProperties="true"/>
>     </filter>
>     [the same structure for the other 59 pages in this package]
> {noformat}
> During this investigation I already applied a series of improvements (not all 
> yet reported, and not all yet committed), and I was able to bring the time 
> down to 76 seconds.
> Now I see a lot of situations where the stack looks like this:
> {noformat}
>         at 
> java.util.regex.Pattern$CharPropertyGreedy.match(java.base@21.0.2/Pattern.java:4461)
>         at 
> java.util.regex.Pattern$Begin.match(java.base@21.0.2/Pattern.java:3851)
>         at java.util.regex.Matcher.match(java.base@21.0.2/Matcher.java:1794)
>         at java.util.regex.Matcher.matches(java.base@21.0.2/Matcher.java:754)
>         at 
> org.apache.jackrabbit.vault.fs.filter.DefaultPathFilter.matches(DefaultPathFilter.java:92)
>         at 
> org.apache.jackrabbit.vault.fs.api.PathFilterSet.contains(PathFilterSet.java:103)
>         at 
> org.apache.jackrabbit.vault.fs.config.DefaultWorkspaceFilter.includesProperty(DefaultWorkspaceFilter.java:273)
>         at 
> org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.setUnprotectedProperties(DocViewImporter.java:1280)
>         at 
> org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.createNewNode(DocViewImporter.java:1182)
>         at 
> org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.addNode(DocViewImporter.java:931)
>         at 
> org.apache.jackrabbit.vault.fs.impl.io.DocViewImporter.startDocViewNode(DocViewImporter.java:410)
>         at 
> org.apache.jackrabbit.vault.fs.impl.io.DocViewSAXHandler.startElement(DocViewSAXHandler.java:353)
>         at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(java.xml@21.0.2/AbstractSAXParser.java:518)
> {noformat}
> Here it seems that a lot of time is spent in checking the WorkspaceFilter, if 
> the nodes in the content packages are actually covered by the filters and are 
> therefor allowed to be imported. Given the specific circumstances in this 
> case, this is consistently true.
> To assess the potential impact of an improvement in this case, I shortcut the 
> logic in {{DefaultWorkspaceFilter.includesProperty}} and let it return 
> consistently {{true}}. With this change in my test I was able to bring down 
> the import time of the package to 52s, which is an improvement of more than 
> 30%! Which shows that there is a large potential impact in improving this 
> logic, but I don't see how this can be improved in a significant way (the 
> regex is already compiled), as this means that during this package 
> installation we do 14 regex checks per property * 15k properties per page * 
> 60 pages = 12.6M regex matches. But getting rid of these checks in the 
> general case is also not possible.
> For that reason I am thinking about introducing a new flag to the 
> ImportOptions, in which the the code invoking the installation of a package 
> can request that these checks are ignored. This information is then passed 
> down to the WorkspaceFilter, which can take that into consideration and ignore
> Which in my specific is possible, as I control both the creation of the 
> packages, the transport to the consumer's side and also the code triggering 
> the import the package.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to