John Cowan wrote:
Jason Orendorff scripsit:
> I think people who favor strings-as-codepoint-vectors must also think
> that breaking a surrogate pair is really bad.  But even with a
> codepoint-centric view of text you can unwittingly break a grapheme
> cluster, which amounts to the same sort of bug--it can lead to garbled
> text--and which is probably much *more* common in practice.  I never
> hear anyone complain about that.

I absolutely disagree that these two problems are analogous at all:

I guess we just have to disagree.  Both cases involve a character
being botched because software broke the data at an inappropriate
boundary.  To me, they're not just analogous; they're practically
identical.  I'm trying to imagine how I would explain the distinction
to my wife.  Drawing a blank here.

Separating surrogate pairs is (a) UTF-16 specific and (b) leaves the
result uninterpretable.  Gumming up a grapheme cluster is more like
an off-by-one error in inserting a character: the output is garbled
but not garbage.

Most systems recover from the former error by losing the one broken
character (some systems replace it with '?'; some render a blank box)
and interpreting everything else just fine.  I don't know what you
mean by "uninterpretable".

Most systems recover from the latter error by silently discarding the
orphaned combining marks.

(shrug)   I don't see how the first one is more annoying than slow
software, while the second one is negligible--especially given that
surrogate pairs are extremely rare in practice (few people's names
contain Byzantine musical symbols or Kharoṣṭhī letters) compared
to, you know, accents.

-j
_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Reply via email to