Re: Unicode Normalization

Mike Klaas Wed, 11 Apr 2007 20:33:07 -0700

On 4/11/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 4/11/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> Unicode characters do not map
> precisely to code points:  a single character can often be represented
> via a single codepoint or a combination of two (surrogate pair).


I normally hear surrogates in the context of UTF-16 after the code point space
became too large for UTF-16 to represent.  AFAIK it's more of an
encoding thing, not a code point thing... for example, you would never
see the surrogates if you encoded in UTF8 (although the surrogates are
still code points since they needed to be reserved).


You're right.  Bringing up surrogate pairs just muddles the discussion.

But there do seem to be groups of code points that map to a single character:
http://en.wikipedia.org/wiki/Combining_character

> have no idea how java's String class handles this--I doubt it does any
> intelligent normalization.

UTF-16 surrogates are handled as of Java5.


And it seems that character composition and normalization is built in to java 6:
http://weblogs.java.net/blog/joconner/archive/2007/02/normalization_c.html

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Unicode Normalization

Reply via email to