On Wed, 2009-02-18 at 18:48 -0500, John Cowan wrote: > As the R6RS process's chief Unicode hound, I'd like to say a word or > two about why I think Unicode matters. There are at least three kinds > of reasons.
I'm one of those who has been critical of the particular impact that Unicode had on R6RS so maybe it would be helpful to explain my complaints and to make clear what I'm not complaining about. We agree: Scheme character and string primitives and library functions "should be designed from the ground up to [be able to] deal with text in a universal encoding. I appreciate that: almost all popular criticisms of Unicode as a universal character system are wrong. Unicode is designed extremely, almost painfully, well. The correct architecture it uses is then well appointed with the actual list of abstract characters, arrived at by an impossibly difficult political process. Like Scheme, Unicode is a gem. We agree: "Scheme programmers are themselves dealing with text when they write their programs, and if the repertoire of characters allowed in a program is non-universal, the result is an unfair disadvantaging of people who use another repertoire natively." I additionally appreciate: With a large tap hit to you, not to embarrass you, R6 appears to get its support for Unicode "simply Right". Not in some vague, namby-pamby "oh, I know what 'the right thing' is when I see it" but in a very objective way: it's a faithful, basis-set complete reification of the logical model of Unicode, which is itself "simply Right" for reasons not worth rehearsing here. The Scheme community should look at this with pride when comparing themselves with, for example, the history of Java, Python, et al. On the issue of whether string access of "O(1)": fogeddaboutit. RAM (main memory) itself is significantly far from O(1). As I noted somewhere back there, and Clinger has looked at more seriously, and others have noticed elsewhere: "ropes" and wide-as-needed choices of code units for the leaf nodes of ropes is about as good as it gets for any universal encoding; UTF-8 is just fine for many situations that don't need that level of performance but that prefer quick and easy implementation given available resources. It's all really quite good. Except: 1) Should the standard allow a conforming implementation to support only a smaller character set in source texts? (I say "yes" for pragmatic reasons - I don't need all of Unicode or all the Scheme libraries in the world for a tiny embedded system, for example.) 2) Should the standard allow a conforming implementation to support a larger character set than Unicode? (I say "yes" for experience-based reasons. For example, you'll pry my bucky-bits from my cold dead hands.) 3) Should the standard allow a conforming implementation to to treat a class of objects such as Unicode compound characters (an infinite character set)? (I am persuaded by what Dillenger has sketched - "yes" again.) The character/string language in R6 is flawed mainly in that it fails to be artfully ambiguous enough to allow those possibilities even while specing out proper Unicode support. I have a sense that you want to use a standard to "force" implementers not to blow off Unicode support. I think that's philosophically and politically wrong. There's no need to rehash these arguments right here, right now - you have some sense of how the arguments on my side of these debates goes and we've all the time in the world to maybe get around to finding agreement or not. I'm just saying that these issues with the current Unicode support are *my* issues with it and I'm suggesting, not trying to prove, that they sum up everyone else's complaints, too. Summing up: The R6 Unicode stuff is really, really good work. It's good for Scheme. It's good education for the Scheme community. It's achingly textbook perfect beautiful for how, holding the Unicode standard in one hand and an assignment to design a programming language in the other you should do it. It's fantastic. Except that it's not ambiguous enough and displays some mild "fascist" tendencies. The problems are easily fixed, though. Regards, -t > > 1) If a process must deal with text, it should be designed from the > ground up to deal with text in a universal encoding, converting to local > encodings only when required to interface with surrounding systems.&It's > been estimated that building in Unicode adds perhaps 20% to development > cost, whereas retrofitting it adds about 100%. That's an "industrial" > motive to support Unicode, and although the (rnrs unicode (6)) library > doesn't come close to providing all that's needed for practical work, > it does provide a useful core. > > 2) Scheme requires that there exist in the application domain strings > which are constructed as sequences of characters. (I think that's a > mistake: I'd rather have strings as primitives and understand characters > to be a finite subset of short strings.) Having the significance and > interpretation of characters differ from one implementation to the next > is a needless kind of variation: in practice it means that portable > programs must be confined to ASCII data. Breaking the historical link > between characters and octets is something that should be done in the > core whether or not anything else about Unicode is supported. > > 3) But most deeply, I believe, is the fact that Scheme programmers are > themselves dealing with text when they write their programs, and if the > repertoire of characters allowed in a program is non-universal, the result > is an unfair disadvantaging of people who use another repertoire natively. > > I've been told that one of the main reasons that Java caught on so quickly > in Japan is that it was the first mainstream language that required > implementations to support meaningful Japanese identifiers written in > the native script. Java's reserved identifiers of course had to remain > in Latin script, but there are only a few of them compared to the vast > number of identifiers in a Java program. More serious was the fact that > the names of existing standard and non-standard Java libraries were and > are typically Latin. > > In Scheme, of course, there are no reserved words, and macros permit > arbitrary renamings so that a whole program could be valid Scheme > even though not a single Latin-script identifier appeared outside the > mapping macros. Allowing programmers to write Scheme using meaningful > identifiers from their native language, written in the usual way, is to > me a matter of elemental fairness, and ought to be allowed in all cases. > _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
