Re: [r6rs-discuss] Why Unicode matters

Ray Dillinger Tue, 03 Mar 2009 01:56:04 -0800

On Wed, 2009-02-18 at 18:48 -0500, John Cowan wrote:
> As the R6RS process's chief Unicode hound, I'd like to say a word or
> two about why I think Unicode matters.  There are at least three kinds
> of reasons.
> 
> 1) If a process must deal with text, it should be designed from the
> ground up to deal with text in a universal encoding, converting to local
> encodings only when required to interface with surrounding systems.&It's
> been estimated that building in Unicode adds perhaps 20% to development
> cost, whereas retrofitting it adds about 100%.


I think that %20 is too high a cost to bear when one is working on
code that one already knows will *not* be deployed widely.  A typical
use for a scheme program is a one-time format conversion of a large 
database.  Once it's done, the program gets erased.  Why should I 
spend an extra 20% effort when I know the target database does not 
and will not use unicode?

> That's an "industrial"
> motive to support Unicode, and although the (rnrs unicode (6)) library
> doesn't come close to providing all that's needed for practical work,
> it does provide a useful core.

The useful core should be *available,* true.  It should be possible 
to write Scheme programs without using it.  It would be good, IMO, to 
define a set of functions that a character library must provide, so 
that it's possible for an implementation to define multiple character 
libraries and for users to choose one to include.

> 2) Scheme requires that there exist in the application domain strings
> which are constructed as sequences of characters.  (I think that's a
> mistake: I'd rather have strings as primitives and understand characters
> to be a finite subset of short strings.)  Having the significance and
> interpretation of characters differ from one implementation to the next
> is a needless kind of variation: in practice it means that portable
> programs must be confined to ASCII data.  

One does not address needless variation by imposing needless uniformity.
It is *important* to the language that it should be possible to make 
a useful conforming implementation using the local environment's native 
character set.  In some cases that's unicode, so it's worthwhile for the
standard to speak about what a unicode implementation ought to look
like.  But in some cases it's not.  There are still companies publishing
phone books that use IBM mainframes with an EBCDIC encoding; requiring
unicode semantics for all the functions in the scheme language makes a 
conforming implementation utterly useless in their environments. So it's
worthwhile for the standard to allow other character sets too. 

> Breaking the historical link
> between characters and octets is something that should be done in the
> core whether or not anything else about Unicode is supported.

Supporting Octets (as opposed to characters) is something that has
needed doing for a long time.  Proper support for octets should have 
allowed higher-level concepts like characters and strings to be moved 
out of the core and into libraries, where people could define 
libraries to support various character sets and programmers 
could choose which library - Unicode or other - to load.  Why 
didn't it?

> 3) But most deeply, I believe, is the fact that Scheme programmers are
> themselves dealing with text when they write their programs, and if the
> repertoire of characters allowed in a program is non-universal, the result
> is an unfair disadvantaging of people who use another repertoire natively.

This argument is political, not technical. 

In practice I agree with it; but I don't agree that unicode is a
truly universal encoding.  Unicode is a good choice, but I don't 
see this as a situation where there has to be a single choice.  

Scheme syntax _requires_ the parentheses, a very few other 
punctuation characters, arabic digits for its numeric syntax, and 
latin letters to spell the names initially bound to its defined
procedures.  Although the standard may and probably ought to 
recommend unicode semantics and define what a conforming unicode 
implementation looks like, I do not believe that the standard 
should require more of the character set than that it have the 
characters scheme syntax requires.  

It should be possible to implement a conforming scheme system 
using any encoding where the encoding has the characters 
that schems's syntax requires.  The standard should certainly 
*allow* conforming unicode implementations (R5RS didn't, and R6 
was supposed to fix that) but it should not require unicode 
semantics in environments where unicode isn't the local machine's 
native encoding. 

More than that, it should be possible to load one's *choice* of 
character encoding libraries, written in scheme, now that we 
have binary octets in the core of the language. 

                                Bear



_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Why Unicode matters

Reply via email to