Re: [r6rs-discuss] Strings as codepoint-vectors: bad

Thomas Lord Thu, 15 Mar 2007 22:26:12 -0800

(Mostly just unpacking Cowan's points a bit.)

Jason Orendorff wrote:

John Cowan wrote:

Jason Orendorff scripsit:
> I think people who favor strings-as-codepoint-vectors must also think
> that breaking a surrogate pair is really bad.  But even with a
> codepoint-centric view of text you can unwittingly break a grapheme
> cluster, which amounts to the same sort of bug--it can lead to garbled
> text--and which is probably much *more* common in practice.  I never
> hear anyone complain about that.


I absolutely disagree that these two problems are analogous at all:


I guess we just have to disagree.  Both cases involve a character
being botched because software broke the data at an inappropriate
boundary.  To me, they're not just analogous; they're practically
identical.  I'm trying to imagine how I would explain the distinction
to my wife.  Drawing a blank here.


You'd have to explain the tower of representations.  You could
compare a single wrod with a typo in it  to a
sentence or phrase out of with some words order -- then explain
the analogy to encoding and character composition.

As usual in this area, it is hard to sort (for example) your and Cowan'sdiscussion

out because of unqualified uses of the word "character" and not enough
precision in distinguishing layers of the representation tower for
a technical audience.

But if we did sort that out, your main point is along the lines of
saying that similar errors in low-level string manipulation (off-by-one
errors and similar) create both bugs and, either way, you get garbage.
Cowan's point is that the two bugs, even if the same coding errors
result in them, have different impact on basic
unicode algorithms.  For example, you can translate a garbled grapheme
cluster to utf-8 just fine but, strictly speaking, not so an isolated
surrogate -- so presumably systems will tend to degrade more gracefully
if they only have one of those two kinds of bugs.

Separating surrogate pairs is (a) UTF-16 specific and (b) leaves the
result uninterpretable.  Gumming up a grapheme cluster is more like
an off-by-one error in inserting a character: the output is garbled
but not garbage.


What he said.

-t

Most systems recover from the former error by losing the one broken
character (some systems replace it with '?'; some render a blank box)
and interpreting everything else just fine.  I don't know what you
mean by "uninterpretable".

Most systems recover from the latter error by silently discarding the
orphaned combining marks.

(shrug)   I don't see how the first one is more annoying than slow
software, while the second one is negligible--especially given that
surrogate pairs are extremely rare in practice (few people's names
contain Byzantine musical symbols or Kharos.t.hi- letters) compared
to, you know, accents.

-j
------------------------------------------------------------------------

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Strings as codepoint-vectors: bad

Reply via email to