On Sun, 2009-09-20 at 08:56 -0700, Ray Dillinger wrote:
> On Sat, 2009-09-19 at 16:48 -0700, Thomas Lord wrote:
> 
> > [....]
> 
> If you are appending and taking substrings, the codepoint level 
> is one of several wrong choices to make about where to allow
> string divisions, for exactly this reason.
> 
> What human beings think of as characters, are represented in unicode
> by a base codepoint plus nondefective sequence of combining 
> modifiers and variant selectors, each of which is also a codepoint.

Certainly.   I've long been attracted to your
notion to have a string-type that works in the way
you've been describing for quite a while.

In particular, I think it is very important that
the definition of a "character" in Scheme not 
preclude the possibility of a string type of the 
sort you describe.

However, as a person who likes "systems programming"
and writing regexp matchers and even implementing basic
Unicode algorithms: I really want choices.  I want to be
able to have encoding unit, codepoint, and full character
strings (not necessarily all in the same string-like object).
And I want to be able to define some generics that work on
all of these (where that makes sense) as well as procedures
that require just a particular kind of string.




> The sequence is usually length zero, but since you're talking about
> renormalizing after divisions, you're already talking about cases 
> where the sequence is nonempty. 
> 
> If you allow division of strings on codepoint boundaries which 
> are not also character boundaries, you can "renormalize" but in 
> this case the renormalization operation makes no semantic sense. 
> You have created characters that were not there, you have 
> vanished characters that were there, you have changed characters 
> into different characters, and so on.  These are not sensible 
> operations; these are bugs.
> 
> If you restrict string division to character boundaries, then 
> you have no need to "renormalize" because by not dividing strings 
> in mid-character or joining strings that start or end with partial
> characters, you never create a denormalized string. Further, 
> the characters on each side of the division are the same 
> characters that were there in the undivided string, so the 
> user does not experience this class of inconsistencies and 
> bugs.
> 
> This is why I believe that the best semantics for string-length, 
> indexes in strings, etc, is that they should count characters 
> rather than codepoints.  And this is one of the things that I 
> believed then and still believe now that R6RS got wrong.

That's a reasonable view when a string is being regarded
primarily as human text to be manipulated in linguistically
significant ways.   Strings as a data structure are more 
general than that, though.

-t



_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Reply via email to