[jira] Created: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text

Shinya Kasatani (JIRA) Sun, 06 Feb 2011 21:15:00 -0800

NGramTokenFilter may generate offsets that exceed the length of original text
-----------------------------------------------------------------------------


                 Key: LUCENE-2909
                 URL: https://issues.apache.org/jira/browse/LUCENE-2909
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/analyzers
    Affects Versions: 2.9.4
            Reporter: Shinya Kasatani
            Priority: Minor


Whan using NGramTokenFilter combined with CharFilters that lengthen the 
original text (such as "ß" -> "ss"), the generated offsets exceed the length of 
the origianal text.
This causes InvalidTokenOffsetsException when you try to highlight the text in 
Solr.

While it is not possible to know the accurate offset of each character once you 
tokenize the whole text with tokenizers like KeywordTokenizer, NGramTokenFilter 
should at least avoid generating invalid offsets.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text

Reply via email to