Hello - i noticed something peculiar running Lucene/Solr 6.3.0.
The plural vaccinatieprogramma's should have a startOffset of 0 and a endOffset
of 21 when passed through WordDelimiterFilter and/or stemmers but it isn't,
slightly messing up highlighted terms.
wdf = new WordDelimiterFilter(new CannedTokenStream(new
Token("vaccinatieprogramma's", 0, 21)), DEFAULT_WORD_DELIM_TABLE, flags, null);
assertTokenStreamContents(wdf,
new String[] { "vaccinatieprogramma"},
new int[] { 0 },
new int[] { 21 });
[junit4] Suite:
org.apache.lucene.analysis.miscellaneous.TestWordDelimiterFilter
[junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=TestWordDelimiterFilter -Dtests.method=testOffsets
-Dtests.seed=21AB10650E10CEB9 -Dtests.slow=true -Dtests.locale=bg-BG
-Dtests.timezone=Etc/GMT+10 -Dtests.asserts=true
-Dtests.file.encoding=ISO-8859-1
[junit4] FAILURE 0.06s | TestWordDelimiterFilter.testOffsets <<<
[junit4] > Throwable #1: java.lang.AssertionError: endOffset 0
expected:<21> but was:<19>
I would expect the same behaviour a stemmers, the length of the term is always
the length of the original term. So if a user queries for a sigular term, the
whole plural (original) is highlighted.
Am i missing something? Bug?
Thanks,
Markus
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]