I think people who favor strings-as-codepoint-vectors must think that the codepoint is a good level of abstraction for text. Really it's not.
One or more Unicode characters may make up what the user thinks of as a character or basic unit of the language. To avoid ambiguity with the computer use of the term character, this is called a grapheme cluster. For example, `G' + acute-accent is a grapheme cluster: it is thought of as a single character by users, yet is actually represented by two Unicode code points. -- Unicode Standard Annex #29 http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries In Java, C#, and in all likelihood Python 3.0, strings are immutable sequences of 16-bit values (UTF-16 code units). Surrogate pairs are totally ignored. This is a good design. Treating a string as a sequence of Unicode codepoints has few real-world use cases. For ordinary text-munging, we use higher-level functions such as (string-append), (string-find), (string-replace), (string-starts-with?), and so on. In other words, the objects we want to use when working with strings are... substrings. Note that all these useful functions can be implemented "naively" in terms of UTF-16 code units and they'll work just fine, even on surrogate pairs. The only use cases I know of for codepoint sequences are to implement Unicode algorithms, like laying out bidirectional text. Here UTF-16 is no real burden compared to the sheer complexity of the task at hand. (See http://unicode.org/reports/tr9/ for example.) By contrast, passing a UTF-16 string to some external function is an extremely common and important use case. It's especially important on Windows and for anything that targets the JVM or CLR. I think people who favor strings-as-codepoint-vectors must also think that breaking a surrogate pair is really bad. But even with a codepoint-centric view of text you can unwittingly break a grapheme cluster, which amounts to the same sort of bug--it can lead to garbled text--and which is probably much *more* common in practice. I never hear anyone complain about that. Making strings vectors of 16-bit values is simple, familiar, speed-efficient, memory-efficient, easy to implement, and convenient for programmers. -j _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
