[PR] OAK-10682 - [Indexing job] Improve Mongo regex filter to only use positive conditions (no negations) [jackrabbit-oak]

via GitHub Tue, 02 Apr 2024 05:20:37 -0700


nfsantos opened a new pull request, #1394:
URL: https://github.com/apache/jackrabbit-oak/pull/1394


   The current implementation of filtering excluded paths and custom regex is 
using a condition like
   
   ` _id:  { $nin: [ /^[0-9]{1,3}:\/content\/dam\/.*$/ ]`
   
   Mongo cannot evaluate this condition without retrieving the full document, 
because a value of _null would also match this condition and the index does not 
contain null values. Therefore, when the index contains excluded paths, the 
download will be much slower because Mongo has to retrieve every single 
document to evaluate the condition.
   
   As a workaround, we can transform the regex on an equivalent one that 
matches the complement of the original regex using [negative 
lookahead](https://stackoverflow.com/questions/1240275/how-to-negate-specific-word-in-regex).
 This allows rewriting the filter condition using only positive conditions, 
which can be evaluated using only the index.
   
   
   ## Performance
   
   As an example of the difference between the two approaches, the two 
following queries count the number of documents that match the regex filter. 
Using `$not` in the regex:
   ```
   > db.nodes.find({ $and: [{ "_modified": { "$gte": 0 } }, { _id: { $not: { 
$regex: /^[0-9]{1,3}:\/content\/dam\/.*$/ } } }, { _id: { $not: { $regex: 
/^[0-9]{1,3}:\/oak:index\/.*$/ } } }] }).sort({ "_modified": 1 }).count()
   15338210
   ```
   4m18s
   
   Using a negated regex to turn the Mongo filter into a positive filter:
   ```
   > db.nodes.find( { $and: [ {"_modified": {"$gte": 0}}, { _id: { $regex: 
/^(?![0-9]{1,3}:\/content\/dam\/)/ } }, { _id: { $regex: 
/^(?![0-9]{1,3}:\/oak:index\/)/ } }]} ).sort({ "_modified":1 }).count()
   15338210
   ```
   39s
   
   
   The plan for the query with `$not` fetches every document and then applies 
the regex filters:
   ```
     stage: 'FETCH',
     filter: {
       '$and': [
         {
           _id: {
             '$not': { '$regex': '^[0-9]{1,3}:\\/content\\/dam\\/.*$' }
           }
         },
         {
           _id: { '$not': { '$regex': '^[0-9]{1,3}:\\/oak:index\\/.*$' } }
         }
       ]
     },
     inputStage: {
       stage: 'IXSCAN',
       keyPattern: { _modified: 1, _id: 1 },
       indexName: '_modified_1__id_1',
       isMultiKey: false,
       multiKeyPaths: { _modified: [], _id: [] },
       isUnique: false,
       isSparse: false,
       isPartial: false,
       indexVersion: 2,
       direction: 'forward',
       indexBounds: { _modified: [ '[0, inf.0]' ], _id: [ '[MinKey, MaxKey]' ] }
     }
   ```
   
   While the plan for the query with the negated regex applies the filter 
during the index scan, fetching only the documents that match:
   
   ```
     stage: 'FETCH',
     inputStage: {
       stage: 'IXSCAN',
       filter: {
         '$and': [
           {
             _id: { '$regex': '^(?![0-9]{1,3}:\\/content\\/dam\\/)' }
           },
           { _id: { '$regex': '^(?![0-9]{1,3}:\\/oak:index\\/)' } }
         ]
       },
       keyPattern: { _modified: 1, _id: 1 },
       indexName: '_modified_1__id_1',
       isMultiKey: false,
       multiKeyPaths: { _modified: [], _id: [] },
       isUnique: false,
       isSparse: false,
       isPartial: false,
       indexVersion: 2,
       direction: 'forward',
       indexBounds: { _modified: [ '[0, inf.0]' ], _id: [ '["", {})' ] }
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] OAK-10682 - [Indexing job] Improve Mongo regex filter to only use positive conditions (no negations) [jackrabbit-oak]

Reply via email to