CJKTokenizer convert HALFWIDTH_AND_FULLWIDTH_FORMS wrong
----------------------------------------------------------
Key: LUCENE-1490
URL: https://issues.apache.org/jira/browse/LUCENE-1490
Project: Lucene - Java
Issue Type: Bug
Reporter: Daniel Cheng
Fix For: 2.4
CJKTokenizer have these lines..
if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS)
{
/** convert HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN */
int i = (int) c;
i = i - 65248;
c = (char) i;
}
This is wrong. Some character in the block (e.g. U+ff68) have no BASIC_LATIN
counterparts.
Only 65281-65374 can be converted this way.
The fix is
if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS &&
i <= 65474 && i> 65281) {
/** convert HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN */
int i = (int) c;
i = i - 65248;
c = (char) i;
}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]