[jira] Issue Comment Edited: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

Steven Rowe (JIRA) Thu, 16 Jul 2009 11:14:45 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732060#action_12732060
 ]


Steven Rowe edited comment on LUCENE-1683 at 7/16/09 11:12 AM:
---------------------------------------------------------------

bq. ... why is RegexQuery treating the trailing "." as a ".*" instead? 

JavaUtilRegexCapabilities.match() is implemented as 
j.u.regex.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", 
unless you explicity append a "$" to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), 
which does not imply the trailing ".*".

The difference in the two implementations implies this is a kind of bug, 
especially since the javadoc "contract" on RegexCapabilities.match() just says 
"@return true if string matches the pattern last passed to compile".

The fix is to switch JavaUtilRegexCapabilities.match to use Matcher.match() 
instead of lookingAt().

      was (Author: steve_rowe):
    bq. ... why is RegexQuery treating the trailing "." as a ".*" instead? 

JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), 
which is equivalent to adding a trailing ".*", unless you explicity append a 
"$" to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), 
which does not imply the trailing ".*".

The difference in the two implementations implies this is a kind of bug, 
especially since the javadoc "contract" on RegexCapabilities.match() just says 
"@return true if string matches the pattern last passed to compile".

The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() 
instead of lookingAt().
  
> RegexQuery matches terms the input regex doesn't actually match
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1683
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1683
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.3.2
>            Reporter: Trejkaz
>
> I was writing some unit tests for our own wrapper around the Lucene regex 
> classes, and got tripped up by something interesting.
> The regex "cat." will match "cats" but also anything with "cat" and 1+ 
> following letters (e.g. "cathy", "catcher", ...)  It is as if there is an 
> implicit .* always added to the end of the regex.
> Here's a unit test for the behaviour I would expect myself:
>     @Test
>     public void testNecessity() throws Exception {
>         File dir = new File(new File(System.getProperty("java.io.tmpdir")), 
> "index");
>         IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), 
> true);
>         try {
>             Document doc = new Document();
>             doc.add(new Field("field", "cat cats cathy", Field.Store.YES, 
> Field.Index.TOKENIZED));
>             writer.addDocument(doc);
>         } finally {
>             writer.close();
>         }
>         IndexReader reader = IndexReader.open(dir);
>         try {
>             TermEnum terms = new RegexQuery(new Term("field", 
> "cat.")).getEnum(reader);
>             assertEquals("Wrong term", "cats", terms.term());
>             assertFalse("Should have only been one term", terms.next());
>         } finally {
>             reader.close();
>         }
>     }
> This test fails on the term check with terms.term() equal to "cathy".
> Our workaround is to mangle the query like this:
>     String fixed = String.format("(?:%s)$", original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

Reply via email to