On Aug 24, 2009, at 5:38 PM, Ray Dillinger wrote: > On Mon, 2009-08-24 at 16:39 -0400, John Cowan wrote: > >> As you know, I'd like to see characters flushed from Scheme and all >> other languages. That's not practical, though, given the high >> barriers >> to removing IEEE Scheme features from small Scheme. > > I agree in principle; characters in Unicode do not behave in the > well-ordered ways that made the distinction between characters and > strings seem useful in IEEE Scheme. There was an unspoken > assumption that we were talking exclusively about environments > with ASCII-like encodings, which has turned out recently to be > false. > > It would be better to abandon the idea of characters as separate > from strings. What is a character, after all? It's a string of > length one. And what consistent semantics are provided by our > character-specific functions that aren't visibly redundant with > the semantics of string functions? Approximately none. So yeah, > there's a point here to be made about characters being a fundamentally > flawed notion in the presence of unicode environments. > > In practice, I don't know if we can do this. It would break > so much existing scheme code.
After thinking about this for a while, I'm convinced that there is value to having a tagged type to represent individual code points. I believe that the facilities provided by the language (or is that "the language, working group 2"?) should provide a range of facilities for working with strings or text suitable for uses ranging from writing new encoders and decoders to interactive editing and display functions that work with text at the grapheme cluster level. At the highest level, the notion of a code point as something which stands alone seems a bit silly, but at the lowest level I believe it makes sense. It is the smallest unit of text which is idempotent under encoding and decoding, which means that it is for all practical purposes indivisible. (I don't think half of a surrogate pair counts as a proper division of a code point, and it's actually a rather dangerous thing to have lying around.) It is logically distinguishable from an integer; while every code point can be uniquely mapped to an integer, not every integer can be mapped to a code point, and the operations defined on integers don't make sense on code points. I'm also not convinced by the argument that a string of length one removes the need for a separate tagged representation for the units of which the string is composed. The most primitive facility provided by any decoder or encoder is a mapping between code points and sequences of bytes; when working at that level, I'd prefer to have a type with a disjoint predicate representing the well-defined input type I am receiving. -- Brian Mastenbrook [email protected] http://brian.mastenbrook.net/ _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
