[
https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378612#comment-16378612
]
Uwe Schindler commented on LUCENE-8186:
---------------------------------------
I think the main problem is that "normalizing" is defined to only apply all the
tokenfilters on a single term, not doing tokenization!
The problem in your example is LowercaseTokenizer, which does 2 things at same
time. IMHO, LowercaseTokenizer should be deprecated and removed. It is always
always replaceable by using the LetterTokenizer with LowercaseFilter. There
should be no performance difference anymore, as both do the same, it's just an
additional method call and the loop is executed twice instead of once. But the
expensive work is the same.
> CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms
> ------------------------------------------------------------------------------
>
> Key: LUCENE-8186
> URL: https://issues.apache.org/jira/browse/LUCENE-8186
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Tim Allison
> Priority: Minor
>
> While working on SOLR-12034, a unit test that relied on the
> LowerCaseTokenizerFactory failed.
> After some digging, I was able to replicate this at the Lucene level.
> Unit test:
> {noformat}
> @Test
> public void testLCTokenizerFactoryNormalize() throws Exception {
> Analyzer analyzer =
> CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build();
> //fails
> assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));
>
> //now try an integration test with the classic query parser
> QueryParser p = new QueryParser("f", analyzer);
> Query q = p.parse("Hello");
> //passes
> assertEquals(new TermQuery(new Term("f", "hello")), q);
> q = p.parse("Hello*");
> //fails
> assertEquals(new PrefixQuery(new Term("f", "hello")), q);
> q = p.parse("Hel*o");
> //fails
> assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
> }
> {noformat}
> The problem is that the CustomAnalyzer iterates through the tokenfilters, but
> does not call the tokenizer, which, in the case of the LowerCaseTokenizer,
> does the filtering work.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]