Re: 32-bit Unicode character literals

Stephen C. Gilardi Mon, 27 Apr 2009 07:50:28 -0700


On Apr 27, 2009, at 10:07 AM, samppi wrote:

I see. Does this mean that, if I expect to handle 32-bit characters,
then I need to consider changing my character-handling functions to
accept sequences of vectors instead?

The blog post touches on this and searching around on Google and Wikipedia should turn up more info. Many APIs, especially in the Character class now have versions that accept and return "int" arguments (in addition to versions that accept and return "char" arguments) to support code points beyond 0xFFFF. int is wide enough to hold any individual Unicode character.

It may be convenient for you to work Strings rather than individual characters when possible.

Java strings (as of 1.5) are now UTF-16 encoded. This encoding allows (legal) Unicode code points in the range 0 to 0xFFFF to be encoded as a single Java character. To represent code points outside that range, UTF-16 uses a range of code points that are illegal for a single unicode character (0xD800–0xDFFF) to encode the code point a pair of 16-bit values (Chars) called a surrogate pair. Using this encoding you can represent all strings made up of legal Unicode code points as Java strings.

Also, how does (seq "\ud800\udc00") work? Does it split the character
into two 16-bit characters? In the REPL, it seems to return (\? \?).

seq doesn't know about the UTF16 encoding. It returns a sequence of every Char even if it is part of a surrogate pair. It would be possible to write a seq implementation for strings that knows about UTF-16 and returns a sequence of Unicode code points represented as ints instead. Whether doing that is useful or not depends on what you're hoping to do with the characters.


--Steve

smime.p7s
Description: S/MIME cryptographic signature

Re: 32-bit Unicode character literals

Reply via email to