[
https://issues.apache.org/jira/browse/LUCENE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12569706#action_12569706
]
Koji Sekiguchi commented on LUCENE-973:
---------------------------------------
The current CJKTokenizer returns a redundant empty string at the end of token
stream when it tokenizes CJK characters.
String str = "C1C2C3";
Tokenizer tokenizer = new CJKTokenizer( new StringReader( str ) );
for( Token token = tokenizer.next(); token != null; token = tokenizer.next() )
System.out.println( "token = \"" + token.termText() + "\"" );
This should be:
token = "C1C2"
token = "C2C3"
but the current CJKTokenizer outputs:
token = "C1C2"
token = "C2C3"
token = ""
The attached test case reproduce this problem and the patch solves it.
> Token of "" returns in CJK
> ---------------------------
>
> Key: LUCENE-973
> URL: https://issues.apache.org/jira/browse/LUCENE-973
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 2.3
> Reporter: Toru Matsuzawa
> Attachments: CJKTokenizer20070807.patch
>
>
> The "" string returns as Token in the boundary of two byte character and one
> byte character.
> There is no problem in CJKAnalyzer.
> When CJKTokenizer is used with the unit, it becomes a problem. (Use it with
> Solr etc.)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]