> > Summary > > "This document attempts to make the case that it is advantageous to use > > UTF-16 (or 16-bit Unicode strings) for text processing..." > > IMHO this is one of the worst mistakes Unicode is trying to make. > It convinces people that they should not worry about characters above > U+FFFF just because they are very rare. UTF-16 combines the worst > aspects of UTF-8 and UTF-32.
No, that's wrong. Here's a direct quote from the document: "Important: Supplementary code points must be supported for full Unicode support, regardless of the encoding form. Many characters are assigned supplementary code points, and even whole scripts are entirely encoded outside of the BMP. The opportunity for optimization of 16-bit Unicode string processing is that the most commonly used characters are stored with single 16-bit code units, so that it is useful to concentrate performance work on code paths for them, while also maintaining support and reasonable performance for supplementary code points." > If size is important and variable width of the representation of a code > point is acceptable, then UTF-8 is usually a better choice. If O(1) > indexing by code points is important, then UTF-32 it better. Nobody > wants to process texts in terms of UTF-16 code units. Nobody wants to > have surrogate processing sprinkled around the code, and thus if one > accepts an API which extracts variable width characters, then the API > could as well deal with UTF-8, which is better for interoperability. > UTF-16 makes no sense. No, that's wrong. I've provided links to many documents written by experts with experience in the field. For example, Dr. Mark Davis is a co-founder of Unicode, president of the Consortium, original architect of ICU, and Chief Globalization Architect at IBM. Richard Gillam was a member of IBM's Unicode Technology Group and an Engineer at the Unicode Technology Center for Java Technology. He was also part of the team that added Unicode to JavaScript. People like Markus Scherer have similar backgrounds. Each of those documents says the same thing: UTF-16 is the best overall trade-off of space & time & ease-of-use. But I'll tell you what. Find a document, written by someone with substantial Unicode experience, that recommends UTF-32 as the best overall in-memory encoding. I haven't found such a document, not a single one, but maybe you can. (I mean that; maybe I wasn't searching in the right places or with the right words.) _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
