On 2026-03-13, David Bremner wrote:
> At this point I haven't looked at how to support regex and extended
> globs in the same field. I do wonder how much of peoples regex use
> could be replaced by extended globs.

FWIW, it looks like almost all the regexp search uses in the notmuch git
repo could be translated exactly into a glob format.

It'd be possible to handle regex in a similar way to wildcards in
Xapian.  I implemented glob-style as it built upon the existing support
for a trailing `*` wildcard, and seems more widely useful since it is
easier to explain the syntax to less technical users.

The actual regex matching could be left to a regex library, but
it helps efficiency to know any fixed prefix on the pattern (because
then we can skip to the first term with that prefix, and stop when
we pass the last term with that prefix - this avoids even considering a
lot of terms which are never going to match).

Less important, but also knowing any fixed suffix and/or minimum length
and/or maximum length that can match allows quickly eliminating more
terms.

In some cases these pre-checks would mean we never actually need to do
a regex check - e.g. for from:/^bremner/ we only need to check that the
term is at least 12 bytes and starts with "XFROMbremner" (which is what
happens for from:bremner*).

Looking at the PCRE API, it seems you can get the min length (via
pcre_study() then pcre_fullinfo() with PCRE_INFO_MINLENGTH), but I don't
see a way to get the other things.  We can mostly rely on a regex
library to be optimised for checking strings, but not being able to
skip to the first term for a front-anchored regex is a shame.  Some
common cases could be handled with a very simple parser (e.g. just
checking for `^` followed 1 or more non-special characters would
optimise from:/^bremner/ and from:/^c(arl|worth)/ though would miss the
chance to optimise from:/^(carl|cworth)/).

There's also a potential for denial of service attacks via crafted regex
queries which excessively backtrack. This is not really a concern in
notmuch, but for a web search it is; we're matching terms which can't be
more than 245 bytes which may be too short to cause a problem but it
needs some investigation.  There may also be ways to limit backtracking toq
a sensible amount without affecting non-malicious queries (e.g. PCRE's
match_limit or match_limit_recursion).

If someone's interested in working on this, I can point out the relevant
parts of the code.

Cheers,
    Olly

_______________________________________________
notmuch mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to