I am posting this as an individual member of the Scheme community. I am not speaking for the R6RS editors, and this message should not be confused with the editors' eventual formal response.
MichaelL wrote: > Or when the abstraction leaks, as string-ref does for UTF-8 and UTF-16. I don't understand what you mean by saying "the abstraction leaks" for string-ref and/or UTF-8 and UTF-16, particularly since the draft R6RS does not tell implementations to use UTF-8 or UTF-16 or not to use UTF-8 or UTF-16. > Do > you think that being able to write string-find portably & efficiently is > important? Yes. With the current draft R6RS, that can be done only if implementors have enough brains to provide O(1) amortized time for string-ref. Implementors can accomplish that by any one of dozens of plausible strategies. The simplest strategy is to use UTF-32, and the more complex strategies use a mixture of representations, some of which may use caching. I don't intend to teach a seminar here on implementation strategies for O(1) string-ref, but I'll describe just one simple strategy that achieves O(1) time for both string-ref and string-set! while using only a little more space than UTF-8. The basic idea is to represent every string by an opaque, sealed record whose fields include a vector of bytevectors. All but the last of those bytevectors is the UTF-8 encoding of exactly 100 characters; the last one contains between 0 and 100 characters, inclusive, and contains 0 characters iff the length of the entire string is 0. Implementation of O(1) string-ref and string-set! for that representation is left as an exercise for readers who understand big-oh notation. I don't expect any implementations to use a representation as bad as the one I described above. That was just to show that achieving O(1) time for string-ref and string-set! is child's play compared to some of the other stuff mandated by the current draft R6RS. I do think most implementors have enough brains to provide efficient O(1) amortized time string-ref, but I could be wrong about that. Programmers who are paranoid about the performance of string-ref can convert their strings to bytevectors in whatever byte-level representation they prefer, and hope that bytevector-ref is O(1). To make it easier to write representation-specific algorithms in Scheme, someone could write a SRFI that provides conversions between R6RS strings and bytevectors that represent text using UTF-8, UTF-16, or UTF-32, and provides an appropriate set of operations for each of those bytevector representations. I don't think this SRFI needs to be part of the R6RS, since a portable reference implementation would solve the portability problem. Folding that SRFI into the R6RS wouldn't make it run any faster. > The one other consideration is the use of external libraries. Unicode is a > very big standard, and parts of it (like collation) are very complicated. > You really do not want to be writing your own implementation of the > Unicode Collation Algorithm. > > Windows and Mac are both UTF-16. Java and .NET are both UTF-16. IBM's > ICU--an excellent open source, cross-platform, cross-language [C, C++, > Java] internationalization library--is UTF-16 (with increasing UTF-8 > support). Linux (and, I believe, Solaris) are UCS-4. Reading on: > You left out one popular encoding, UCS-2. And on: > On Linux, for example, UTF-8 is increasingly the default system > encoding--but Linux's wide-chars are UCS-4. Many of libc's string > operations--eg, strcoll--will work directly on UTF-8 strings; others first > require conversion to UCS-4. And on: > These days UTF-8 is the overwhelming favorite for transmitting and storing > text, and is the assumed default of almost any new standard. Summarizing: No single encoding is going to solve the problem of interfacing with external libraries (which, by the way, is a problem the draft R6RS does not even attempt to address). Conclusion: The R6RS should not mandate any particular encoding or representation of strings. The current draft doesn't. Will _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
