[ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794841#action_12794841
 ] 

Robert Muir commented on LUCENE-2183:
-------------------------------------

Hello, first comment is that I really like how the IO-handling is done in 
CharacterUtils.

This solves a problem across more than CharTokenizer, other tokenizers in 
lucene contrib that do NOT extend CharTokenizer have the same logic and also 
need to be fixed.

So we could reuse this code in other places too, such as CJKTokenizer. I think 
we could also reuse this code to fix some unrelated problems in the n-gram 
tokenizers (at a glance, i do not see how the n-gram tokenizer io-handling even 
works correctly at all)


> Supplementary Character Handling in CharTokenizer
> -------------------------------------------------
>
>                 Key: LUCENE-2183
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2183
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2183.patch
>
>
> CharTokenizer is an abstract base class for all Tokenizers operating on a 
> character level. Yet, those tokenizers still use char primitives instead of 
> int codepoints. CharTokenizer should operate on codepoints and preserve bw 
> compatibility. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to