supplementary character handling
--------------------------------
Key: LUCENE-1689
URL: https://issues.apache.org/jira/browse/LUCENE-1689
Project: Lucene - Java
Issue Type: Improvement
Reporter: Robert Muir
Priority: Minor
for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
supplementary character support should be fixed for code that works with
char/char[]
For example:
StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed
so they don't actually remove suppl characters, or modified to look for
surrogates and behave correctly.
LowercaseFilter should be modified to lowercase suppl. characters correctly.
CharTokenizer should either be deprecated or changed so that isTokenChar() and
normalize() use int.
in all of these cases code should remain optimized for the BMP case, and suppl
characters should be the exception, but still work.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]