[jira] [Commented] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms

Uwe Schindler (JIRA) Tue, 27 Feb 2018 05:58:12 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378612#comment-16378612
 ]


Uwe Schindler commented on LUCENE-8186:
---------------------------------------

I think the main problem is that "normalizing" is defined to only apply all the 
tokenfilters on a single term, not doing tokenization!

The problem in your example is LowercaseTokenizer, which does 2 things at same 
time. IMHO, LowercaseTokenizer should be deprecated and removed. It is always 
always replaceable by using the LetterTokenizer with LowercaseFilter. There 
should be no performance difference anymore, as both do the same, it's just an 
additional method call and the loop is executed twice instead of once. But the 
expensive work is the same.

> CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms 
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-8186
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8186
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>
> While working on SOLR-12034, a unit test that relied on the 
> LowerCaseTokenizerFactory failed.
> After some digging, I was able to replicate this at the Lucene level.
> Unit test:
> {noformat}
>   @Test
>   public void testLCTokenizerFactoryNormalize() throws Exception {
>     Analyzer analyzer =  
> CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build();
>     //fails
>     assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));
>     
>     //now try an integration test with the classic query parser
>     QueryParser p = new QueryParser("f", analyzer);
>     Query q = p.parse("Hello");
>     //passes
>     assertEquals(new TermQuery(new Term("f", "hello")), q);
>     q = p.parse("Hello*");
>     //fails
>     assertEquals(new PrefixQuery(new Term("f", "hello")), q);
>     q = p.parse("Hel*o");
>     //fails
>     assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
>   }
> {noformat}
> The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
> does not call the tokenizer, which, in the case of the LowerCaseTokenizer, 
> does the filtering work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms

Reply via email to