Re: [r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode)

William D Clinger Mon, 19 Mar 2007 20:08:47 -0800

I am posting this as an individual member of the Scheme
community.  I am not speaking for the R6RS editors, and
this message should not be confused with the editors'
eventual formal response.

MichaelL wrote:

> Or when the abstraction leaks, as string-ref does for UTF-8 and UTF-16.

I don't understand what you mean by saying "the abstraction
leaks" for string-ref and/or UTF-8 and UTF-16, particularly
since the draft R6RS does not tell implementations to use
UTF-8 or UTF-16 or not to use UTF-8 or UTF-16.

> Do 
> you think that being able to write string-find portably & efficiently is 
> important?

Yes.  With the current draft R6RS, that can be done only if
implementors have enough brains to provide O(1) amortized
time for string-ref.  Implementors can accomplish that by
any one of dozens of plausible strategies.  The simplest
strategy is to use UTF-32, and the more complex strategies
use a mixture of representations, some of which may use
caching.

I don't intend to teach a seminar here on implementation
strategies for O(1) string-ref, but I'll describe just one
simple strategy that achieves O(1) time for both string-ref
and string-set! while using only a little more space than
UTF-8.  The basic idea is to represent every string by an
opaque, sealed record whose fields include a vector of
bytevectors.  All but the last of those bytevectors is the
UTF-8 encoding of exactly 100 characters; the last one
contains between 0 and 100 characters, inclusive, and
contains 0 characters iff the length of the entire string
is 0.

Implementation of O(1) string-ref and string-set! for that
representation is left as an exercise for readers who
understand big-oh notation.

I don't expect any implementations to use a representation
as bad as the one I described above.  That was just to show
that achieving O(1) time for string-ref and string-set! is
child's play compared to some of the other stuff mandated
by the current draft R6RS.

I do think most implementors have enough brains to provide
efficient O(1) amortized time string-ref, but I could be
wrong about that.  Programmers who are paranoid about the
performance of string-ref can convert their strings to
bytevectors in whatever byte-level representation they
prefer, and hope that bytevector-ref is O(1).

To make it easier to write representation-specific
algorithms in Scheme, someone could write a SRFI that
provides conversions between R6RS strings and bytevectors
that represent text using UTF-8, UTF-16, or UTF-32,
and provides an appropriate set of operations for each
of those bytevector representations.  I don't think this
SRFI needs to be part of the R6RS, since a portable
reference implementation would solve the portability
problem.  Folding that SRFI into the R6RS wouldn't
make it run any faster.

> The one other consideration is the use of external libraries. Unicode is a 
> very big standard, and parts of it (like collation) are very complicated. 
> You really do not want to be writing your own implementation of the 
> Unicode Collation Algorithm.
> 
> Windows and Mac are both UTF-16. Java and .NET are both UTF-16. IBM's 
> ICU--an excellent open source, cross-platform, cross-language [C, C++, 
> Java] internationalization library--is UTF-16 (with increasing UTF-8 
> support). Linux (and, I believe, Solaris) are UCS-4.

Reading on:

> You left out one popular encoding, UCS-2.

And on:

> On Linux, for example, UTF-8 is increasingly the default system 
> encoding--but Linux's wide-chars are UCS-4. Many of libc's string 
> operations--eg, strcoll--will work directly on UTF-8 strings; others first 
> require conversion to UCS-4.

And on:

> These days UTF-8 is the overwhelming favorite for transmitting and storing 
> text, and is the assumed default of almost any new standard.

Summarizing:  No single encoding is going to solve the
problem of interfacing with external libraries (which,
by the way, is a problem the draft R6RS does not even
attempt to address).

Conclusion:  The R6RS should not mandate any particular
encoding or representation of strings.

The current draft doesn't.

Will

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode)

Reply via email to