[ 
https://issues.apache.org/jira/browse/OAK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860355#comment-17860355
 ] 

Nuno Santos commented on OAK-10914:
-----------------------------------

There are two different issues here:
- Use of the index
- Applying the filter condition using only the index or having to retrieve the 
full document from the column store

The queries below all use the _modified_id index. But the query with the 
negation of the regex must retrieve all documents that pass the test on 
_modified to apply the regex filter on {{/content/dam}}. The use of the index 
is effective to restrict the number of documents retrieved from the column 
store to those that match the condition on {{_modified}}. But the query with 
the positive regex or the one using negative lookahead will only retrieve the 
documents that match the modified and the regex conditions, which can be a much 
smaller number, depending on the selectivity of the regex filter. 

Below I show the query plans for all the 3 options.

h3. Positive regex
{noformat}
db.nodes.explain().find( { $and: [ {"_modified": {"$gte": 0}}, { _id: { $regex: 
/^[0-9]{1,3}:\/content\/dam\/.*$/ }}]} ).sort({ "_modified":1 }).hint( 
"_modified_1__id_1"  )

   winningPlan: {
      stage: 'FETCH',
      inputStage: {
        stage: 'IXSCAN',
        filter: { _id: { '$regex': '^[0-9]{1,3}:\\/content\\/dam\\/.*$' } },
        keyPattern: { _modified: 1, _id: 1 },
        indexName: '_modified_1__id_1',
        isMultiKey: false,
        multiKeyPaths: { _modified: [], _id: [] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: {
          _modified: [ '[0, inf.0]' ],
          _id: [
            '["", {})',
            '[/^[0-9]{1,3}:\\/content\\/dam\\/.*$/, 
/^[0-9]{1,3}:\\/content\\/dam\\/.*$/]'
          ]
        }
{noformat}
Both filter on modified and the regex filter are applied during the IXSCAN. 
Mongo will only fetch the documents that passed both tests.

h3. Negation of regex on Mongo query
{noformat}
db.nodes.find( { $and: [ {"_modified": {"$gte": 0}}, { _id:  { $not: { $regex: 
/^[0-9]{1,3}:\/content\/dam\/.*$/ }}}]} ).sort({ "_modified":1 }).hint( 
"_modified_1__id_1"  ).count()

    winningPlan: {
      stage: 'FETCH',
      filter: {
        _id: { '$not': { '$regex': '^[0-9]{1,3}:\\/content\\/dam\\/.*$' } }
      },
      inputStage: {
        stage: 'IXSCAN',
        keyPattern: { _modified: 1, _id: 1 },
        indexName: '_modified_1__id_1',
        isMultiKey: false,
        multiKeyPaths: { _modified: [], _id: [] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: { _modified: [ '[0, inf.0]' ], _id: [ '[MinKey, MaxKey]' ] 
}
      }
    },
{noformat}
The IXSCAN phase only filters on _modified. Mongo will fetch all documents that 
match the _modified condition, and only then apply the regex.

h3. Negative lookahead

{noformat}
db.nodes.explain().find( { $and: [ {"_modified": {"$gte": 0}}, { _id:  { 
$regex: /^(?![0-9]{1,3}:\/content\/dam\/).*$/ } }]} ).sort({ "_modified":1 
}).hint( "_modified_1__id_1"  )

   winningPlan: {
      stage: 'FETCH',
      inputStage: {
        stage: 'IXSCAN',
        filter: { _id: { '$regex': '^(?![0-9]{1,3}:\\/content\\/dam\\/).*$' } },
        keyPattern: { _modified: 1, _id: 1 },
        indexName: '_modified_1__id_1',
        isMultiKey: false,
        multiKeyPaths: { _modified: [], _id: [] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: {
          _modified: [ '[0, inf.0]' ],
          _id: [
            '["", {})',
            '[/^(?![0-9]{1,3}:\\/content\\/dam\\/).*$/, 
/^(?![0-9]{1,3}:\\/content\\/dam\\/).*$/]'
          ]
        }
{noformat}
Like with the positive regex, Mongo applies the regex filter during the IXSCAN 
phase.

I have tested all these queries and the one with negative lookahead has a 
comparable performance to the one with positive regex, and both of these 
perform much better than the one with the negated regex.
 

> fullGC exclude paths should use _modified_id index
> --------------------------------------------------
>
>                 Key: OAK-10914
>                 URL: https://issues.apache.org/jira/browse/OAK-10914
>             Project: Jackrabbit Oak
>          Issue Type: Task
>          Components: documentmk
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>            Priority: Major
>
> As 
> [noticed|https://github.com/apache/jackrabbit-oak/pull/1547#discussion_r1648725361]
>  by [~nuno.santos] the current state of how exclude paths are implemented 
> would result in {{_modified_id}} index not being used and thus be slow. This 
> needs to be improved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to