On Sun, 16 Feb 2003, Bruce Lilly wrote:
> >>152,000 have at least one 8bit character, of which
> >>26 match the utf-8 rule, of which
> >>17 appear to be false matches
> > So, 26/152,000 = .017% 17/152,000 = .011%
> No, no, no; the rate of false positives (again assuming
> that one knows the real charset) is the ratio of the
> false matches to the total matching the utf-8 rule, or
> 17 / 26  which is greater than 65%.

That's correct, and that pretty much shoots down the use of a test to
determine if something is UTF-8.  The test can prove that text is not
UTF-8 (assuming that the UTF-8 wasn't somehow damaged in transit), but it
does not reliably prove that text is UTF-8.

Furthermore, the potential of damage in transit is very real.  Consider
the impact of line breaking, or passage through gateways which apply
well-meaning but incorrect transforms.

Therefore, there are not just false positives (at an unacceptable 65%
rate) but also false negatives.

Raw UTF-8 in headers seems to be a complete non-starter.  I recommend that
this idea be abandoned in favor of an interoperable means, such as
proposed by Kohn.  It doesn't have to be Kohn's document (I neither know,
nor care, what religious or personality issues may be involved), but at
least he's on the right track.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.

Reply via email to