Mark Harwood created LUCENE-9370:
------------------------------------
Summary: RegExpQuery should error for inappropriate use of \
character in input
Key: LUCENE-9370
URL: https://issues.apache.org/jira/browse/LUCENE-9370
Project: Lucene - Core
Issue Type: Bug
Components: core/search
Affects Versions: master (9.0)
Reporter: Mark Harwood
The RegExp class is too lenient in parsing user input which can confuse or
mislead users and cause backwards compatibility issues as we enhance regex
support.
In normal regular expression syntax the backslash is used to:
* escape a reserved character like \.
* use certain unreserved characters in a shorthand context e.g. \d means
digits [0-9]
The leniency bug in RegExp is that it adds an extra rule to this list - any
backslashed characters that don't satisfy the above rules are taken literally.
For example, there's no reason to put a backslash in front of the letter "p"
but we accept \p as the letter p.
Java's Pattern class will throw a parse exception given a meaningless backslash
like \p.
We should too.
In [Lucene-9336|https://issues.apache.org/jira/browse/LUCENE-9336] we added
support for commonly supported regex expressions like `\d`. Sadly this is a
breaking change because of the leniency that has allowed \d to be accepted as
the letter d without an exception. Users were likely silently missing results
they were hoping for and we made a BWC problem for ourselves in filling in the
gaps.
I propose we do like other RegEx parsers and error on inappropriate use of
backslashes.
This will be another breaking change so should target 9.0
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]