[jira] Updated: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text
[ https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2909: Attachment: LUCENE-2909_assert.patch here's a check we can add to BaseTokenStreamTestCase for this condition. NGramTokenFilter may generate offsets that exceed the length of original text - Key: LUCENE-2909 URL: https://issues.apache.org/jira/browse/LUCENE-2909 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.9.4 Reporter: Shinya Kasatani Assignee: Koji Sekiguchi Priority: Minor Attachments: LUCENE-2909_assert.patch, TokenFilterOffset.patch Whan using NGramTokenFilter combined with CharFilters that lengthen the original text (such as ß - ss), the generated offsets exceed the length of the origianal text. This causes InvalidTokenOffsetsException when you try to highlight the text in Solr. While it is not possible to know the accurate offset of each character once you tokenize the whole text with tokenizers like KeywordTokenizer, NGramTokenFilter should at least avoid generating invalid offsets. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text
[ https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shinya Kasatani updated LUCENE-2909: Attachment: TokenFilterOffset.patch The patch that fixes the problem, including tests. NGramTokenFilter may generate offsets that exceed the length of original text - Key: LUCENE-2909 URL: https://issues.apache.org/jira/browse/LUCENE-2909 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.9.4 Reporter: Shinya Kasatani Priority: Minor Attachments: TokenFilterOffset.patch Whan using NGramTokenFilter combined with CharFilters that lengthen the original text (such as ß - ss), the generated offsets exceed the length of the origianal text. This causes InvalidTokenOffsetsException when you try to highlight the text in Solr. While it is not possible to know the accurate offset of each character once you tokenize the whole text with tokenizers like KeywordTokenizer, NGramTokenFilter should at least avoid generating invalid offsets. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org