Re: [r6rs-discuss] perhaps i should be formal, but....

MichaelL Wed, 14 Mar 2007 19:50:41 -0800

> > Bytevectors are definitely a very useful low-level addition to Scheme. 
But 
> > single/multi-byte strings were, I think, an unnecessary loss, 
especially 
> > for those who do lots of operating sytem- and library-level work.
> 
> You seem to be lamenting the loss of something that never
> was.


Am I? Perhaps I'm focusing too narrowly on the Scheme implemenations I've 
used. Bigloo has separate Unicode/non-Unicode (single/multi-byte) 
character & string types; Chicken & Chez both have non-Unicode 
(single/multi-byte) character & string types. All of them have good 
foreign interface support. And single/multi-byte strings still dominate 
most foreign libraries. To get a quick sense, search for "char *" vs 
"wchar_t *" at 
http://www.gnu.org/software/libc/manual/html_mono/libc.html. Even good old 
fopen requires a single/mutli-byte string!

> > In fact, my position 
> > would be even more extreme: I lament the loss of single/multi byte 
strings 
> > in general (which would include UTF-8). They're still useful for 
low-level 
> > work. In fact, they'll still be needed--think of the various Scheme to 
C 
> > compilers, for example, that will need a char equivalent--they just 
won't 
> > be standardized anymore.
> 
> I don't know exactly what you mean by single/multi byte
> strings, but you indicated that they include UTF-8.
>
> I am not aware of anything in R5RS that would correspond
> to any definition of single/multi byte strings that would
> include UTF-8. 

By "single/multi-byte string" I mean the equivalent of C's "char *" type 
(as opposed to "wchar_t *"). On Linux, Mac, and Solaris, UTF-8 is a 
supported multi-byte encoding; indeed, it is becoming the preferred 
encoding. (See, for example, "What programming languages support Unicode?" 
and following at http://www.cl.cam.ac.uk/~mgk25/unicode.html.) Mac's 
wchar_t is UTF-16, and Linux and Solaris are UCS-4, but in fact many libc 
str/mb functions work correctly on UTF-8 strings without conversion to 
wchar_t. (I may be wrong about Solaris' wchar_t type; it's been a while 
since I looked.)

> So what do you mean by saying they "won't
> be standardized anymore"?

Well, as I said I was probably thinking only about the Schemes I've used 
over the last couple of years, and in all of them "string" was the 
equivalent of "single/multi-byte string." Bad assumption. Gambit, as far 
as I remember, is UCS-2, and PLT went Unicode a while ago, though I don't 
remember which encoding they use.

But any Scheme with a good foreign interface will have to deal with char 
strings. I presume there will be two choices: either convert strings 
automatically, or provide two different character & string types. If you 
care about control & performance, the first option isn't good, so you'd 
want the second. And if you go the second route each Scheme will come up 
with its own set of names and operations for single/multi-byte strings. 
So: right now within the Schemes I've used there's agreement on 
single/multi-byte strings but none on Unicode, and I'm going to end up 
trading that for agreement on Unicode and none on single/multi-byte 
strings.

Btw, this is of practical interest to me. These days I make a living 
writing cross-platform web server software in Chez. After a long look at 
all of these issues we've decided to add Unicode support to Chez via IBM's 
ICU library (http://www-306.ibm.com/software/globalization/icu/index.jsp). 
ICU is UTF-16, but we'll read and write UTF-8 and we'll keep our UTF-8 
strings in Chez' current string type.

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] perhaps i should be formal, but....

Reply via email to