> > Bytevectors are definitely a very useful low-level addition to Scheme. But > > single/multi-byte strings were, I think, an unnecessary loss, especially > > for those who do lots of operating sytem- and library-level work. > > You seem to be lamenting the loss of something that never > was.
Am I? Perhaps I'm focusing too narrowly on the Scheme implemenations I've used. Bigloo has separate Unicode/non-Unicode (single/multi-byte) character & string types; Chicken & Chez both have non-Unicode (single/multi-byte) character & string types. All of them have good foreign interface support. And single/multi-byte strings still dominate most foreign libraries. To get a quick sense, search for "char *" vs "wchar_t *" at http://www.gnu.org/software/libc/manual/html_mono/libc.html. Even good old fopen requires a single/mutli-byte string! > > In fact, my position > > would be even more extreme: I lament the loss of single/multi byte strings > > in general (which would include UTF-8). They're still useful for low-level > > work. In fact, they'll still be needed--think of the various Scheme to C > > compilers, for example, that will need a char equivalent--they just won't > > be standardized anymore. > > I don't know exactly what you mean by single/multi byte > strings, but you indicated that they include UTF-8. > > I am not aware of anything in R5RS that would correspond > to any definition of single/multi byte strings that would > include UTF-8. By "single/multi-byte string" I mean the equivalent of C's "char *" type (as opposed to "wchar_t *"). On Linux, Mac, and Solaris, UTF-8 is a supported multi-byte encoding; indeed, it is becoming the preferred encoding. (See, for example, "What programming languages support Unicode?" and following at http://www.cl.cam.ac.uk/~mgk25/unicode.html.) Mac's wchar_t is UTF-16, and Linux and Solaris are UCS-4, but in fact many libc str/mb functions work correctly on UTF-8 strings without conversion to wchar_t. (I may be wrong about Solaris' wchar_t type; it's been a while since I looked.) > So what do you mean by saying they "won't > be standardized anymore"? Well, as I said I was probably thinking only about the Schemes I've used over the last couple of years, and in all of them "string" was the equivalent of "single/multi-byte string." Bad assumption. Gambit, as far as I remember, is UCS-2, and PLT went Unicode a while ago, though I don't remember which encoding they use. But any Scheme with a good foreign interface will have to deal with char strings. I presume there will be two choices: either convert strings automatically, or provide two different character & string types. If you care about control & performance, the first option isn't good, so you'd want the second. And if you go the second route each Scheme will come up with its own set of names and operations for single/multi-byte strings. So: right now within the Schemes I've used there's agreement on single/multi-byte strings but none on Unicode, and I'm going to end up trading that for agreement on Unicode and none on single/multi-byte strings. Btw, this is of practical interest to me. These days I make a living writing cross-platform web server software in Chez. After a long look at all of these issues we've decided to add Unicode support to Chez via IBM's ICU library (http://www-306.ibm.com/software/globalization/icu/index.jsp). ICU is UTF-16, but we'll read and write UTF-8 and we'll keep our UTF-8 strings in Chez' current string type. _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
