[ https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794841#action_12794841 ]
Robert Muir commented on LUCENE-2183: ------------------------------------- Hello, first comment is that I really like how the IO-handling is done in CharacterUtils. This solves a problem across more than CharTokenizer, other tokenizers in lucene contrib that do NOT extend CharTokenizer have the same logic and also need to be fixed. So we could reuse this code in other places too, such as CJKTokenizer. I think we could also reuse this code to fix some unrelated problems in the n-gram tokenizers (at a glance, i do not see how the n-gram tokenizer io-handling even works correctly at all) > Supplementary Character Handling in CharTokenizer > ------------------------------------------------- > > Key: LUCENE-2183 > URL: https://issues.apache.org/jira/browse/LUCENE-2183 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2183.patch > > > CharTokenizer is an abstract base class for all Tokenizers operating on a > character level. Yet, those tokenizers still use char primitives instead of > int codepoints. CharTokenizer should operate on codepoints and preserve bw > compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org