Offset bug in WordDelimiterFilter?

Markus Jelsma Tue, 06 Dec 2016 03:28:17 -0800

Hello - i noticed something peculiar running Lucene/Solr 6.3.0.

The plural vaccinatieprogramma's should have a startOffset of 0 and a endOffset 
of 21 when passed through WordDelimiterFilter and/or stemmers but it isn't, 
slightly messing up highlighted terms.


    wdf = new WordDelimiterFilter(new CannedTokenStream(new 
Token("vaccinatieprogramma's", 0, 21)), DEFAULT_WORD_DELIM_TABLE, flags, null); 
   
    assertTokenStreamContents(wdf,
        new String[] { "vaccinatieprogramma"},
        new int[] { 0 },
        new int[] { 21 });

   [junit4] Suite: 
org.apache.lucene.analysis.miscellaneous.TestWordDelimiterFilter
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestWordDelimiterFilter -Dtests.method=testOffsets 
-Dtests.seed=21AB10650E10CEB9 -Dtests.slow=true -Dtests.locale=bg-BG 
-Dtests.timezone=Etc/GMT+10 -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
   [junit4] FAILURE 0.06s | TestWordDelimiterFilter.testOffsets <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: endOffset 0 
expected:<21> but was:<19>

I would expect the same behaviour a stemmers, the length of the term is always 
the length of the original term. So if a user queries for a sigular term, the 
whole plural (original) is highlighted.

Am i missing something? Bug?

Thanks,
Markus

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Offset bug in WordDelimiterFilter?

Reply via email to