[jira] [Commented] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms

Robert Muir (JIRA) Sun, 04 Mar 2018 11:48:31 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385323#comment-16385323
 ]


Robert Muir commented on LUCENE-8186:
-------------------------------------

Yeah, the biggest issue i see is the lack of type safety. Currently the method 
is on an interface like this:

{code}
public AbstractAnalysisFactory getMultiTermComponent();
{code}

This means a CharFilterFactory can return a TokenizerFactory or other crazy 
possibilities. Users will get ClassCastException in such cases. This is all 
unrelated to this issue, but its horrible.

IMO it would be better if the api worked different, e.g. three methods that 
enforce the correct return type. This would remove the casts and prevent stupid 
stuff from happening in the factories themselves.

{code}
TokenizerFactory:
  public TokenFilterFactory getMultiTermComponent() { return null; }
TokenFilterFactory:
  public TokenFilterFactory getMultiTermComponent() { return null; }
CharFilterFactory:
  public CharFilterFactory getMultiTermComponent() { return null; }
{code}


> CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms 
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-8186
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8186
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: LUCENE-8186.patch
>
>
> While working on SOLR-12034, a unit test that relied on the 
> LowerCaseTokenizerFactory failed.
> After some digging, I was able to replicate this at the Lucene level.
> Unit test:
> {noformat}
>   @Test
>   public void testLCTokenizerFactoryNormalize() throws Exception {
>     Analyzer analyzer =  
> CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build();
>     //fails
>     assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello"));
>     
>     //now try an integration test with the classic query parser
>     QueryParser p = new QueryParser("f", analyzer);
>     Query q = p.parse("Hello");
>     //passes
>     assertEquals(new TermQuery(new Term("f", "hello")), q);
>     q = p.parse("Hello*");
>     //fails
>     assertEquals(new PrefixQuery(new Term("f", "hello")), q);
>     q = p.parse("Hel*o");
>     //fails
>     assertEquals(new WildcardQuery(new Term("f", "hel*o")), q);
>   }
> {noformat}
> The problem is that the CustomAnalyzer iterates through the tokenfilters, but 
> does not call the tokenizer, which, in the case of the LowerCaseTokenizer, 
> does the filtering work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms

Reply via email to