Re: UTF-16 vs UTF-32

Allen Wirfs-Brock Mon, 16 May 2011 18:02:32 -0700

On May 16, 2011, at 5:42 PM, Shawn Steele wrote:

> It's clear why we want to support the full Unicode range, but it's less clear 
> to me why UTF-32 would be desirable internally.  (Sure, it'd be nice for 
> conversion types).
> 
> What UTF-32 has that UTF-16 doesn't is the ability to walk a string without 
> accidentally chopping up a surrogate pair.  However, in practice, stepping 
> over surrogates is pretty much the least of the problems with walking a 
> string.  Combining characters and the like cause numerous typographical 
> shapes/glyphs to be represented by more than one Unicode codepoint, even in 
> UTF-32.  We don't see that in Latin so much, especially in NFC, but in some 
> scripts most characters require multiple code points.  
> 
> In other words, if I'm trying to find "safe" places to break a string, append 
> text, or many other operations, then UTF-16 is no more complicated than 
> UTF-32, even when considering surrogates.
> 
> UTF-32 would cause a huge amount of ambiguity though about what happens to 
> all of those UTF-16 sequences that currently sort-of work even though they 
> shouldn't really because ES is nominally UCS-2.
> 
> -Shawn


One reason is that none of the built-in string methods understand surrogate 
pairs. If you want to do any string processing that recognizes such pairs you 
have to either handles such pairs as multi-character sequences or do you own 
character by character processing.

Allen
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: UTF-16 vs UTF-32

Reply via email to