NGramTokenFilter may generate offsets that exceed the length of original text
-----------------------------------------------------------------------------
Key: LUCENE-2909
URL: https://issues.apache.org/jira/browse/LUCENE-2909
Project: Lucene - Java
Issue Type: Bug
Components: contrib/analyzers
Affects Versions: 2.9.4
Reporter: Shinya Kasatani
Priority: Minor
Whan using NGramTokenFilter combined with CharFilters that lengthen the
original text (such as "ß" -> "ss"), the generated offsets exceed the length of
the origianal text.
This causes InvalidTokenOffsetsException when you try to highlight the text in
Solr.
While it is not possible to know the accurate offset of each character once you
tokenize the whole text with tokenizers like KeywordTokenizer, NGramTokenFilter
should at least avoid generating invalid offsets.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]