Re: [r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode)

MichaelL Mon, 19 Mar 2007 18:24:35 -0800

> Is the following a valid summary of the issue?
> 
>    The existence of string-ref and string-set! operations seems to imply
>    that a variable-length internal representation is not an option and
>    a fixed-length representation wastes space and is therefore 
> inefficient
>    (mostly in an ascii-centered world).


Mostly.

The one other consideration is the use of external libraries. Unicode is a 
very big standard, and parts of it (like collation) are very complicated. 
You really do not want to be writing your own implementation of the 
Unicode Collation Algorithm.

Windows and Mac are both UTF-16. Java and .NET are both UTF-16. IBM's 
ICU--an excellent open source, cross-platform, cross-language [C, C++, 
Java] internationalization library--is UTF-16 (with increasing UTF-8 
support). Linux (and, I believe, Solaris) are UCS-4.

If you're serious about supporting Unicode you probably want good UTF-16 
support. UCS-4 support out in the wild just isn't very good on most 
platforms. (While it's supported on Linux, the implementation is bare 
bones and produces some incorrect results.) But if you're serious about 
supporting R5.92RS you're faced with a string-ref that makes UCS-4 the 
easy path. By "easy" I don't just mean the implementation; I mean meeting 
the expectation that string-ref is O(1). If you don't meet that 
expectation your performance on a lot of reasonable algorithms will be 
very poor.

Furthermore, while it's true that you can convert UCS-4 to UTF-16 without 
loss, you probably don't want a system to do that silently each time it 
performs a comparison while sorting 100,000 strings. (I'm assuming a 
locale-aware comparison.) So in my opinion you want the encoding of 
whatever Scheme you use to match the encoding of any library you expect to 
use.

> Unicode text encoded in any one of the formats can be converted to 
> another without loss of information (right?).

Yes. 

You left out one popular encoding, UCS-2. UCS-2 is a 16-bit encoding that 
doesn't support surrogate pairs. That limits it to Unicode's Basic 
Multilingual Plane. These days UCS-2 would probably be frowned on, but at 
least with UCS-2 the code unit size matches the scalar value size for the 
scalar values that UCS-2 supports. Gambit and Bigloo are two examples of 
Scheme systems that support UCS-2, not UTF-16.

> Moreover, the internal representation of strings does not have to 
> match the external representation.  For example,
> you can read a UTF-32 encoded file into a variable-length buffer to save
> some space (sometimes); or alternatively, you can read a UTF-8 
> encoded file into a fixed-length buffer to save time on 
> random-access (sometimes).

Yes.

On Linux, for example, UTF-8 is increasingly the default system 
encoding--but Linux's wide-chars are UCS-4. Many of libc's string 
operations--eg, strcoll--will work directly on UTF-8 strings; others first 
require conversion to UCS-4. (UCS-4 and UTF-32 both encode all Unicode 
characters. UTF-32 has additional semantic expectations.)

>  From what I understand, UTF-8, UTF-16, and UTF-32 are interchange 
> formats.

These days UTF-8 is the overwhelming favorite for transmitting and storing 
text, and is the assumed default of almost any new standard. I myself have 
never seen anyone transmit or store UTF-32.


_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode)

Reply via email to