[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

Robert Muir (JIRA) Mon, 28 Dec 2009 11:13:57 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794886#action_12794886
 ]


Robert Muir commented on LUCENE-2183:
-------------------------------------

Hello Simon, another option very similar to yours (I am not sure if it really 
would work, but just thinking out loud somewhat) could be:

{code}
/** this method will be declared abstract in Lucene 4.0 */
public int isTokenChar(int ch) {
  throw UOE();
}

/** @deprecated will be removed in Lucene 5.0 */
public int isTokenChar(char ch) {
  return isTokenChar((int)ch);
}
{code}

and do the same for normalize(). The rest would be the same as your patch:
* Use CharacterUtils for io-buffering
* Use CharacterUtils for character/codepoint iteration.
* Use Version to decide which method to call instead of reflection: this should 
not be conditional upon each call to isTokenChar() but instead two private 
inner classes or whatever.

The difference would be that the api would appear more natural in my opinion, 
and once deprecations are removed we would end out with an abstract class with 
the int-equivalent of what we have now.

If someone attempts to use a CharTokenizer that does *not* support int-based 
methods (only implements the char-based methods) with Version.LUCENE_31 then 
this would throw UOE, which in my opinion is correct, as it does not support 
the behavior of that version.


> Supplementary Character Handling in CharTokenizer
> -------------------------------------------------
>
>                 Key: LUCENE-2183
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2183
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2183.patch
>
>
> CharTokenizer is an abstract base class for all Tokenizers operating on a 
> character level. Yet, those tokenizers still use char primitives instead of 
> int codepoints. CharTokenizer should operate on codepoints and preserve bw 
> compatibility. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

Reply via email to