On Sat, 2009-09-19 at 16:48 -0700, Thomas Lord wrote:
> Yes, but when you are building a string-like mutable
> type, appending and taking substrings, suddenly you
> are renormalizing on every operation.
If you are appending and taking substrings, the codepoint level
is one of several wrong choices to make about where to allow
string divisions, for exactly this reason.
What human beings think of as characters, are represented in unicode
by a base codepoint plus nondefective sequence of combining
modifiers and variant selectors, each of which is also a codepoint.
The sequence is usually length zero, but since you're talking about
renormalizing after divisions, you're already talking about cases
where the sequence is nonempty.
If you allow division of strings on codepoint boundaries which
are not also character boundaries, you can "renormalize" but in
this case the renormalization operation makes no semantic sense.
You have created characters that were not there, you have
vanished characters that were there, you have changed characters
into different characters, and so on. These are not sensible
operations; these are bugs.
If you restrict string division to character boundaries, then
you have no need to "renormalize" because by not dividing strings
in mid-character or joining strings that start or end with partial
characters, you never create a denormalized string. Further,
the characters on each side of the division are the same
characters that were there in the undivided string, so the
user does not experience this class of inconsistencies and
bugs.
This is why I believe that the best semantics for string-length,
indexes in strings, etc, is that they should count characters
rather than codepoints. And this is one of the things that I
believed then and still believe now that R6RS got wrong.
Bear
_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss