[jira] Commented: (LUCENE-1689) supplementary character handling

Robert Muir (JIRA) Mon, 16 Nov 2009 13:03:06 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778532#action_12778532
 ]


Robert Muir commented on LUCENE-1689:
-------------------------------------

Steven, no its definitely the right place to point it out! I know this is true, 
even with 1.5 :)

One reason I wanted to split this issue up was to try to make 'improvements', 
maybe we do not fix everything.
there are also other options for StandardTokenizer/Jflex
For instance, we could not break between any surrogate pairs and classify them 
as CJK (index individual character) for the time being.
While technically incorrect, it would handle the common cases, i.e. ideographs 
from Big5-HKSCS, etc.

but right now the topic is i guess on unicode and index back compat in 
general... trying to figure out what is the reasonable approach to handling 
this (supporting the old broken behavior/indexes created with them)

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, 
> LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1689) supplementary character handling

Reply via email to