[ https://issues.apache.org/jira/browse/OAK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860355#comment-17860355 ]
Nuno Santos commented on OAK-10914: ----------------------------------- There are two different issues here: - Use of the index - Applying the filter condition using only the index or having to retrieve the full document from the column store The queries below all use the _modified_id index. But the query with the negation of the regex must retrieve all documents that pass the test on _modified to apply the regex filter on {{/content/dam}}. The use of the index is effective to restrict the number of documents retrieved from the column store to those that match the condition on {{_modified}}. But the query with the positive regex or the one using negative lookahead will only retrieve the documents that match the modified and the regex conditions, which can be a much smaller number, depending on the selectivity of the regex filter. Below I show the query plans for all the 3 options. h3. Positive regex {noformat} db.nodes.explain().find( { $and: [ {"_modified": {"$gte": 0}}, { _id: { $regex: /^[0-9]{1,3}:\/content\/dam\/.*$/ }}]} ).sort({ "_modified":1 }).hint( "_modified_1__id_1" ) winningPlan: { stage: 'FETCH', inputStage: { stage: 'IXSCAN', filter: { _id: { '$regex': '^[0-9]{1,3}:\\/content\\/dam\\/.*$' } }, keyPattern: { _modified: 1, _id: 1 }, indexName: '_modified_1__id_1', isMultiKey: false, multiKeyPaths: { _modified: [], _id: [] }, isUnique: false, isSparse: false, isPartial: false, indexVersion: 2, direction: 'forward', indexBounds: { _modified: [ '[0, inf.0]' ], _id: [ '["", {})', '[/^[0-9]{1,3}:\\/content\\/dam\\/.*$/, /^[0-9]{1,3}:\\/content\\/dam\\/.*$/]' ] } {noformat} Both filter on modified and the regex filter are applied during the IXSCAN. Mongo will only fetch the documents that passed both tests. h3. Negation of regex on Mongo query {noformat} db.nodes.find( { $and: [ {"_modified": {"$gte": 0}}, { _id: { $not: { $regex: /^[0-9]{1,3}:\/content\/dam\/.*$/ }}}]} ).sort({ "_modified":1 }).hint( "_modified_1__id_1" ).count() winningPlan: { stage: 'FETCH', filter: { _id: { '$not': { '$regex': '^[0-9]{1,3}:\\/content\\/dam\\/.*$' } } }, inputStage: { stage: 'IXSCAN', keyPattern: { _modified: 1, _id: 1 }, indexName: '_modified_1__id_1', isMultiKey: false, multiKeyPaths: { _modified: [], _id: [] }, isUnique: false, isSparse: false, isPartial: false, indexVersion: 2, direction: 'forward', indexBounds: { _modified: [ '[0, inf.0]' ], _id: [ '[MinKey, MaxKey]' ] } } }, {noformat} The IXSCAN phase only filters on _modified. Mongo will fetch all documents that match the _modified condition, and only then apply the regex. h3. Negative lookahead {noformat} db.nodes.explain().find( { $and: [ {"_modified": {"$gte": 0}}, { _id: { $regex: /^(?![0-9]{1,3}:\/content\/dam\/).*$/ } }]} ).sort({ "_modified":1 }).hint( "_modified_1__id_1" ) winningPlan: { stage: 'FETCH', inputStage: { stage: 'IXSCAN', filter: { _id: { '$regex': '^(?![0-9]{1,3}:\\/content\\/dam\\/).*$' } }, keyPattern: { _modified: 1, _id: 1 }, indexName: '_modified_1__id_1', isMultiKey: false, multiKeyPaths: { _modified: [], _id: [] }, isUnique: false, isSparse: false, isPartial: false, indexVersion: 2, direction: 'forward', indexBounds: { _modified: [ '[0, inf.0]' ], _id: [ '["", {})', '[/^(?![0-9]{1,3}:\\/content\\/dam\\/).*$/, /^(?![0-9]{1,3}:\\/content\\/dam\\/).*$/]' ] } {noformat} Like with the positive regex, Mongo applies the regex filter during the IXSCAN phase. I have tested all these queries and the one with negative lookahead has a comparable performance to the one with positive regex, and both of these perform much better than the one with the negated regex. > fullGC exclude paths should use _modified_id index > -------------------------------------------------- > > Key: OAK-10914 > URL: https://issues.apache.org/jira/browse/OAK-10914 > Project: Jackrabbit Oak > Issue Type: Task > Components: documentmk > Reporter: Stefan Egli > Assignee: Stefan Egli > Priority: Major > > As > [noticed|https://github.com/apache/jackrabbit-oak/pull/1547#discussion_r1648725361] > by [~nuno.santos] the current state of how exclude paths are implemented > would result in {{_modified_id}} index not being used and thus be slow. This > needs to be improved. -- This message was sent by Atlassian Jira (v8.20.10#820010)