On Sun, 2009-09-20 at 08:56 -0700, Ray Dillinger wrote: > On Sat, 2009-09-19 at 16:48 -0700, Thomas Lord wrote: > > > [....] > > If you are appending and taking substrings, the codepoint level > is one of several wrong choices to make about where to allow > string divisions, for exactly this reason. > > What human beings think of as characters, are represented in unicode > by a base codepoint plus nondefective sequence of combining > modifiers and variant selectors, each of which is also a codepoint.
Certainly. I've long been attracted to your notion to have a string-type that works in the way you've been describing for quite a while. In particular, I think it is very important that the definition of a "character" in Scheme not preclude the possibility of a string type of the sort you describe. However, as a person who likes "systems programming" and writing regexp matchers and even implementing basic Unicode algorithms: I really want choices. I want to be able to have encoding unit, codepoint, and full character strings (not necessarily all in the same string-like object). And I want to be able to define some generics that work on all of these (where that makes sense) as well as procedures that require just a particular kind of string. > The sequence is usually length zero, but since you're talking about > renormalizing after divisions, you're already talking about cases > where the sequence is nonempty. > > If you allow division of strings on codepoint boundaries which > are not also character boundaries, you can "renormalize" but in > this case the renormalization operation makes no semantic sense. > You have created characters that were not there, you have > vanished characters that were there, you have changed characters > into different characters, and so on. These are not sensible > operations; these are bugs. > > If you restrict string division to character boundaries, then > you have no need to "renormalize" because by not dividing strings > in mid-character or joining strings that start or end with partial > characters, you never create a denormalized string. Further, > the characters on each side of the division are the same > characters that were there in the undivided string, so the > user does not experience this class of inconsistencies and > bugs. > > This is why I believe that the best semantics for string-length, > indexes in strings, etc, is that they should count characters > rather than codepoints. And this is one of the things that I > believed then and still believe now that R6RS got wrong. That's a reasonable view when a string is being regarded primarily as human text to be manipulated in linguistically significant ways. Strings as a data structure are more general than that, though. -t _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
