[r6rs-discuss] Why Unicode matters

John Cowan Wed, 18 Feb 2009 15:49:49 -0800

As the R6RS process's chief Unicode hound, I'd like to say a word or
two about why I think Unicode matters.  There are at least three kinds
of reasons.


1) If a process must deal with text, it should be designed from the
ground up to deal with text in a universal encoding, converting to local
encodings only when required to interface with surrounding systems.&It's
been estimated that building in Unicode adds perhaps 20% to development
cost, whereas retrofitting it adds about 100%.  That's an "industrial"
motive to support Unicode, and although the (rnrs unicode (6)) library
doesn't come close to providing all that's needed for practical work,
it does provide a useful core.

2) Scheme requires that there exist in the application domain strings
which are constructed as sequences of characters.  (I think that's a
mistake: I'd rather have strings as primitives and understand characters
to be a finite subset of short strings.)  Having the significance and
interpretation of characters differ from one implementation to the next
is a needless kind of variation: in practice it means that portable
programs must be confined to ASCII data.  Breaking the historical link
between characters and octets is something that should be done in the
core whether or not anything else about Unicode is supported.

3) But most deeply, I believe, is the fact that Scheme programmers are
themselves dealing with text when they write their programs, and if the
repertoire of characters allowed in a program is non-universal, the result
is an unfair disadvantaging of people who use another repertoire natively.

I've been told that one of the main reasons that Java caught on so quickly
in Japan is that it was the first mainstream language that required
implementations to support meaningful Japanese identifiers written in
the native script.  Java's reserved identifiers of course had to remain
in Latin script, but there are only a few of them compared to the vast
number of identifiers in a Java program.  More serious was the fact that
the names of existing standard and non-standard Java libraries were and
are typically Latin.

In Scheme, of course, there are no reserved words, and macros permit
arbitrary renamings so that a whole program could be valid Scheme
even though not a single Latin-script identifier appeared outside the
mapping macros.  Allowing programmers to write Scheme using meaningful
identifiers from their native language, written in the usual way, is to
me a matter of elemental fairness, and ought to be allowed in all cases.

-- 
John Cowan    [email protected]    http://ccil.org/~cowan
Half the lies they tell about me are true.
        --Tallulah Bankhead, American actress

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

[r6rs-discuss] Why Unicode matters

Reply via email to