[ 
https://issues.apache.org/jira/browse/LUCENE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720207#action_12720207
 ] 

Michael McCandless commented on LUCENE-973:
-------------------------------------------

Well, my question is: is there any input text that would cause an arbitrary 
number of such 0-length tokens in a row?

Eg the original cause of that was just at the boundary of two byte character 
and one byte character... so if that's the only case that hits 0-length token, 
then we are OK.  But if there are other cases, such that one could chain any 
number of such tokens in sequence, we're not, and we have to translate 
recursion into iteration.


> Token of  "" returns in CJKTokenizer + new TestCJKTokenizer
> -----------------------------------------------------------
>
>                 Key: LUCENE-973
>                 URL: https://issues.apache.org/jira/browse/LUCENE-973
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Toru Matsuzawa
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: CJKTokenizer20070807.patch, LUCENE-973.patch, 
> LUCENE-973.patch, with-patch.jpg, without-patch.jpg
>
>
> The "" string returns as Token in the boundary of two byte character and one 
> byte character. 
> There is no problem in CJKAnalyzer. 
> When CJKTokenizer is used with the unit, it becomes a problem. (Use it with 
> Solr etc.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to