[
https://issues.apache.org/jira/browse/LUCENE-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563380#comment-17563380
]
Andriy Redko edited comment on LUCENE-10642 at 7/6/22 6:33 PM:
---------------------------------------------------------------
Thanks for checking it [~uschindler], the common replacements \t \n \r do
work. Indeed, the error was not thrown before but now it does (so the impact of
using escape sequences is more apparent). The error is also confusing because
the implementation references explicitly the javadoc with character classes and
escape sequences but does not detect latter properly. From the user
perspective, is it non-intuitive why the character classes should be denoted
with two slashes \\ but escape sequences with \, I think we could make it more
convenient for users allow usage of escape sequences the same way as character
classes (at least, this is the way javadoc describes that). Anyway, fix seems
to be simple but please feel free to close the issue if there is no interest in
supporting that. Thank you!
was (Author: reta):
Thanks for checking it [~uschindler], the common replacements \t \n \r do
work. Indeed, the error was not thrown before but now it does (so the impact of
using escape sequences is more apparent). The error is also confusing because
the implementation references explicitly the javadoc with character classes and
escape sequences but does not detect latter properly. From the user
perspective, is it non-intuitive why the character classes should be denoted
with two slashes `\\` but escape sequences with `\`, I think we could make it
more convenient for users allow usage of escape sequences the same way as
character classes (at least, this is the way javadoc describes that). Anyway,
fix seems to be simple but please feel free to close the issue if there is no
interest in supporting that. Thank you!
> Regexp query: escape sequences are treated as character classes
> ---------------------------------------------------------------
>
> Key: LUCENE-10642
> URL: https://issues.apache.org/jira/browse/LUCENE-10642
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 9.0, 9.1, 9.2, 9.3
> Reporter: Andriy Redko
> Priority: Major
>
> Interesting issue has been reported to Opensearch project [1], which has been
> caused by [2], [3]. In the nutshell, the regression is causing escape
> sequences (like \n, \r, \t, ...) to be treated as character classes
> (specifically,
> [https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#bs).]
> The problematic function is RegExp::matchPredefinedCharacterClass which does
> not consider characters that denote an escaped construct. Simple test to
> reproduce which fails with IllegalArgumentException("{color:#0451a5}invalid
> character class{color}"):
>
> {noformat}
> public class TestRegexpQuery extends LuceneTestCase {
> public void testEscapeSequences() throws IOException {
> assertEquals(1, regexQueryNrHits("\\n"));
> assertEquals(1, regexQueryNrHits("[\\n]")); }
> }
> }
> {noformat}
>
> [1] [https://github.com/opensearch-project/OpenSearch/issues/3781]
> [2]
> [https://github.com/apache/lucene/commit/1efce5444dd40142c55c5a3a30eeebc7b86796c3]
> [3]
> [https://github.com/apache/lucene/commit/819e668ce2fcfcf86b652a191cdbe0fad0a8ffce]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]