Re: [r6rs-discuss] Why Unicode matters

Thomas Lord Wed, 18 Feb 2009 18:54:02 -0800

On Wed, 2009-02-18 at 18:48 -0500, John Cowan wrote:
> As the R6RS process's chief Unicode hound, I'd like to say a word or
> two about why I think Unicode matters.  There are at least three kinds
> of reasons.





I'm one of those who has been critical of the particular
impact that Unicode had on R6RS so maybe it would be helpful
to explain my complaints and to make clear what I'm not
complaining about.


We agree: Scheme character and string primitives and 
library functions "should be designed from the ground up
to [be able to] deal with text in a universal encoding. 

I appreciate that: almost all popular criticisms of 
Unicode as a universal character system are wrong.
Unicode is designed extremely, almost painfully, well.
The correct architecture it uses is then well appointed
with the actual list of abstract characters, arrived at
by an impossibly difficult political process.  Like
Scheme, Unicode is a gem.

We agree: "Scheme programmers are themselves dealing 
with text when they write their programs, and if the
repertoire of characters allowed in a program is 
non-universal, the result is an unfair disadvantaging
of people who use another repertoire natively."


I additionally appreciate: With a large tap hit to
you, not to embarrass you, R6 appears to get its 
support for Unicode "simply Right".   Not in some
vague, namby-pamby "oh, I know what 'the right thing'
is when I see it" but in a very objective way: it's 
a faithful, basis-set complete reification of the
logical model of Unicode, which is itself "simply Right"
for reasons not worth rehearsing here.   The Scheme
community should look at this with pride when comparing
themselves with, for example, the history of Java, Python,
et al.

On the issue of whether string access of "O(1)":
fogeddaboutit.   RAM (main memory) itself is significantly
far from O(1).  As I noted somewhere back there, and
Clinger has looked at more seriously, and others have
noticed elsewhere:  "ropes" and wide-as-needed choices
of code units for the leaf nodes of ropes is about as
good as it gets for any universal encoding;  UTF-8 is 
just fine for many situations that don't need that level
of performance but that prefer quick and easy implementation
given available resources.

It's all really quite good.  Except:

1) Should the standard allow a conforming implementation
to support only a smaller character set in source texts?
(I say "yes" for pragmatic reasons - I don't need all
of Unicode or all the Scheme libraries in the world for
a tiny embedded system, for example.)

2) Should the standard allow a conforming implementation
to support a larger character set than Unicode? (I say "yes"
for experience-based reasons.  For example, you'll pry my
bucky-bits from my cold dead hands.)

3) Should the standard allow a conforming implementation to
to treat a class of objects such as Unicode compound 
characters (an infinite character set)?  (I am persuaded
by what Dillenger has sketched - "yes" again.)

The character/string language in R6 is flawed mainly in that
it fails to be artfully ambiguous enough to allow those
possibilities even while specing out proper Unicode support.

I have a sense that you want to use a standard to "force"
implementers not to blow off Unicode support.  I think
that's philosophically and politically wrong.

There's no need to rehash these arguments right here, right
now - you have some sense of how the arguments on my side of
these debates goes and we've all the time in the world to 
maybe get around to finding agreement or not.   I'm just saying
that these issues with the current Unicode support are *my*
issues with it and I'm suggesting, not trying to prove, that
they sum up everyone else's complaints, too.

Summing up: The R6 Unicode stuff is really, really good work.
It's good for Scheme.  It's good education for the Scheme
community.  It's achingly textbook perfect beautiful for how,
holding the Unicode standard in one hand and an assignment to
design a programming language in the other you should do it.
It's fantastic.  Except that it's not ambiguous enough and 
displays some mild "fascist" tendencies.   The problems are easily
fixed, though.

Regards,
-t






> 
> 1) If a process must deal with text, it should be designed from the
> ground up to deal with text in a universal encoding, converting to local
> encodings only when required to interface with surrounding systems.&It's
> been estimated that building in Unicode adds perhaps 20% to development
> cost, whereas retrofitting it adds about 100%.  That's an "industrial"
> motive to support Unicode, and although the (rnrs unicode (6)) library
> doesn't come close to providing all that's needed for practical work,
> it does provide a useful core.
> 
> 2) Scheme requires that there exist in the application domain strings
> which are constructed as sequences of characters.  (I think that's a
> mistake: I'd rather have strings as primitives and understand characters
> to be a finite subset of short strings.)  Having the significance and
> interpretation of characters differ from one implementation to the next
> is a needless kind of variation: in practice it means that portable
> programs must be confined to ASCII data.  Breaking the historical link
> between characters and octets is something that should be done in the
> core whether or not anything else about Unicode is supported.
> 
> 3) But most deeply, I believe, is the fact that Scheme programmers are
> themselves dealing with text when they write their programs, and if the
> repertoire of characters allowed in a program is non-universal, the result
> is an unfair disadvantaging of people who use another repertoire natively.
> 
> I've been told that one of the main reasons that Java caught on so quickly
> in Japan is that it was the first mainstream language that required
> implementations to support meaningful Japanese identifiers written in
> the native script.  Java's reserved identifiers of course had to remain
> in Latin script, but there are only a few of them compared to the vast
> number of identifiers in a Java program.  More serious was the fact that
> the names of existing standard and non-standard Java libraries were and
> are typically Latin.
> 
> In Scheme, of course, there are no reserved words, and macros permit
> arbitrary renamings so that a whole program could be valid Scheme
> even though not a single Latin-script identifier appeared outside the
> mapping macros.  Allowing programmers to write Scheme using meaningful
> identifiers from their native language, written in the usual way, is to
> me a matter of elemental fairness, and ought to be allowed in all cases.
> 


_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Why Unicode matters

Reply via email to