[ https://issues.apache.org/jira/browse/LUCENE-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless resolved LUCENE-1490. ---------------------------------------- Resolution: Fixed Thanks Daniel! > CJKTokenizer convert HALFWIDTH_AND_FULLWIDTH_FORMS wrong > ---------------------------------------------------------- > > Key: LUCENE-1490 > URL: https://issues.apache.org/jira/browse/LUCENE-1490 > Project: Lucene - Java > Issue Type: Bug > Reporter: Daniel Cheng > Assignee: Michael McCandless > Fix For: 2.9, 2.4 > > > CJKTokenizer have these lines.. > if (ub == > Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) { > /** convert HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN > */ > int i = (int) c; > i = i - 65248; > c = (char) i; > } > This is wrong. Some character in the block (e.g. U+ff68) have no BASIC_LATIN > counterparts. > Only 65281-65374 can be converted this way. > The fix is > if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS > && i <= 65474 && i> 65281) { > /** convert HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN > */ > int i = (int) c; > i = i - 65248; > c = (char) i; > } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org