On Aug 24, 2009, at 5:38 PM, Ray Dillinger wrote:

> On Mon, 2009-08-24 at 16:39 -0400, John Cowan wrote:
>
>> As you know, I'd like to see characters flushed from Scheme and all
>> other languages.  That's not practical, though, given the high  
>> barriers
>> to removing IEEE Scheme features from small Scheme.
>
> I agree in principle; characters in Unicode do not behave in the
> well-ordered ways that made the distinction between characters and
> strings seem useful in IEEE Scheme. There was an unspoken
> assumption that we were talking exclusively about environments
> with ASCII-like encodings, which has turned out recently to be
> false.
>
> It would be better to abandon the idea of characters as separate
> from strings.  What is a character, after all?  It's a string of
> length one.  And what consistent semantics are provided by our
> character-specific functions that aren't visibly redundant with
> the semantics of string functions? Approximately none.  So yeah,
> there's a point here to be made about characters being a fundamentally
> flawed notion in the presence of unicode environments.
>
> In practice, I don't know if we can do this.  It would break
> so much existing scheme code.

After thinking about this for a while, I'm convinced that there is  
value to having a tagged type to represent individual code points. I  
believe that the facilities provided by the language (or is that "the  
language, working group 2"?) should provide a range of facilities for  
working with strings or text suitable for uses ranging from writing  
new encoders and decoders to interactive editing and display functions  
that work with text at the grapheme cluster level. At the highest  
level, the notion of a code point as something which stands alone  
seems a bit silly, but at the lowest level I believe it makes sense.  
It is the smallest unit of text which is idempotent under encoding and  
decoding, which means that it is for all practical purposes  
indivisible. (I don't think half of a surrogate pair counts as a  
proper division of a code point, and it's actually a rather dangerous  
thing to have lying around.) It is logically distinguishable from an  
integer; while every code point can be uniquely mapped to an integer,  
not every integer can be mapped to a code point, and the operations  
defined on integers don't make sense on code points.

I'm also not convinced by the argument that a string of length one  
removes the need for a separate tagged representation for the units of  
which the string is composed. The most primitive facility provided by  
any decoder or encoder is a mapping between code points and sequences  
of bytes; when working at that level, I'd prefer to have a type with a  
disjoint predicate representing the well-defined input type I am  
receiving.

--
Brian Mastenbrook
[email protected]
http://brian.mastenbrook.net/

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Reply via email to