[ https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991316#comment-12991316 ]
Robert Muir commented on LUCENE-2909: ------------------------------------- Is the bug really in NGramTokenFilter? This seems to be a larger problem that would affect all tokenfilters that break larger tokens into smaller ones and recalculate offsets, right? For example: EdgeNGramTokenFilter, ThaiWordFilter, SmartChineseAnalyzer's WordTokenFilter, etc? I think WordDelimiterFilter has special code that might avoid the problem (line 352), so it might be ok. Is there any better way we could solve this: for example maybe instead of the tokenizer calling correctOffset() it gets called somewhere else? This seems to be what is causing the problem. > NGramTokenFilter may generate offsets that exceed the length of original text > ----------------------------------------------------------------------------- > > Key: LUCENE-2909 > URL: https://issues.apache.org/jira/browse/LUCENE-2909 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers > Affects Versions: 2.9.4 > Reporter: Shinya Kasatani > Assignee: Koji Sekiguchi > Priority: Minor > Attachments: TokenFilterOffset.patch > > > Whan using NGramTokenFilter combined with CharFilters that lengthen the > original text (such as "ß" -> "ss"), the generated offsets exceed the length > of the origianal text. > This causes InvalidTokenOffsetsException when you try to highlight the text > in Solr. > While it is not possible to know the accurate offset of each character once > you tokenize the whole text with tokenizers like KeywordTokenizer, > NGramTokenFilter should at least avoid generating invalid offsets. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org