[jira] Updated: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text

2011-02-07 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2909:


Attachment: LUCENE-2909_assert.patch

here's a check we can add to BaseTokenStreamTestCase for this condition.


 NGramTokenFilter may generate offsets that exceed the length of original text
 -

 Key: LUCENE-2909
 URL: https://issues.apache.org/jira/browse/LUCENE-2909
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.9.4
Reporter: Shinya Kasatani
Assignee: Koji Sekiguchi
Priority: Minor
 Attachments: LUCENE-2909_assert.patch, TokenFilterOffset.patch


 Whan using NGramTokenFilter combined with CharFilters that lengthen the 
 original text (such as ß - ss), the generated offsets exceed the length 
 of the origianal text.
 This causes InvalidTokenOffsetsException when you try to highlight the text 
 in Solr.
 While it is not possible to know the accurate offset of each character once 
 you tokenize the whole text with tokenizers like KeywordTokenizer, 
 NGramTokenFilter should at least avoid generating invalid offsets.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text

2011-02-06 Thread Shinya Kasatani (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shinya Kasatani updated LUCENE-2909:


Attachment: TokenFilterOffset.patch

The patch that fixes the problem, including tests.

 NGramTokenFilter may generate offsets that exceed the length of original text
 -

 Key: LUCENE-2909
 URL: https://issues.apache.org/jira/browse/LUCENE-2909
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.9.4
Reporter: Shinya Kasatani
Priority: Minor
 Attachments: TokenFilterOffset.patch


 Whan using NGramTokenFilter combined with CharFilters that lengthen the 
 original text (such as ß - ss), the generated offsets exceed the length 
 of the origianal text.
 This causes InvalidTokenOffsetsException when you try to highlight the text 
 in Solr.
 While it is not possible to know the accurate offset of each character once 
 you tokenize the whole text with tokenizers like KeywordTokenizer, 
 NGramTokenFilter should at least avoid generating invalid offsets.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org