> Is the following a valid summary of the issue? > > The existence of string-ref and string-set! operations seems to imply > that a variable-length internal representation is not an option and > a fixed-length representation wastes space and is therefore > inefficient > (mostly in an ascii-centered world).
Mostly. The one other consideration is the use of external libraries. Unicode is a very big standard, and parts of it (like collation) are very complicated. You really do not want to be writing your own implementation of the Unicode Collation Algorithm. Windows and Mac are both UTF-16. Java and .NET are both UTF-16. IBM's ICU--an excellent open source, cross-platform, cross-language [C, C++, Java] internationalization library--is UTF-16 (with increasing UTF-8 support). Linux (and, I believe, Solaris) are UCS-4. If you're serious about supporting Unicode you probably want good UTF-16 support. UCS-4 support out in the wild just isn't very good on most platforms. (While it's supported on Linux, the implementation is bare bones and produces some incorrect results.) But if you're serious about supporting R5.92RS you're faced with a string-ref that makes UCS-4 the easy path. By "easy" I don't just mean the implementation; I mean meeting the expectation that string-ref is O(1). If you don't meet that expectation your performance on a lot of reasonable algorithms will be very poor. Furthermore, while it's true that you can convert UCS-4 to UTF-16 without loss, you probably don't want a system to do that silently each time it performs a comparison while sorting 100,000 strings. (I'm assuming a locale-aware comparison.) So in my opinion you want the encoding of whatever Scheme you use to match the encoding of any library you expect to use. > Unicode text encoded in any one of the formats can be converted to > another without loss of information (right?). Yes. You left out one popular encoding, UCS-2. UCS-2 is a 16-bit encoding that doesn't support surrogate pairs. That limits it to Unicode's Basic Multilingual Plane. These days UCS-2 would probably be frowned on, but at least with UCS-2 the code unit size matches the scalar value size for the scalar values that UCS-2 supports. Gambit and Bigloo are two examples of Scheme systems that support UCS-2, not UTF-16. > Moreover, the internal representation of strings does not have to > match the external representation. For example, > you can read a UTF-32 encoded file into a variable-length buffer to save > some space (sometimes); or alternatively, you can read a UTF-8 > encoded file into a fixed-length buffer to save time on > random-access (sometimes). Yes. On Linux, for example, UTF-8 is increasingly the default system encoding--but Linux's wide-chars are UCS-4. Many of libc's string operations--eg, strcoll--will work directly on UTF-8 strings; others first require conversion to UCS-4. (UCS-4 and UTF-32 both encode all Unicode characters. UTF-32 has additional semantic expectations.) > From what I understand, UTF-8, UTF-16, and UTF-32 are interchange > formats. These days UTF-8 is the overwhelming favorite for transmitting and storing text, and is the assumed default of almost any new standard. I myself have never seen anyone transmit or store UTF-32. _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
