On Sun, 16 Feb 2003, Bruce Lilly wrote: > >>152,000 have at least one 8bit character, of which > >>26 match the utf-8 rule, of which > >>17 appear to be false matches > > So, 26/152,000 = .017% 17/152,000 = .011% > No, no, no; the rate of false positives (again assuming > that one knows the real charset) is the ratio of the > false matches to the total matching the utf-8 rule, or > 17 / 26 which is greater than 65%.
That's correct, and that pretty much shoots down the use of a test to determine if something is UTF-8. The test can prove that text is not UTF-8 (assuming that the UTF-8 wasn't somehow damaged in transit), but it does not reliably prove that text is UTF-8. Furthermore, the potential of damage in transit is very real. Consider the impact of line breaking, or passage through gateways which apply well-meaning but incorrect transforms. Therefore, there are not just false positives (at an unacceptable 65% rate) but also false negatives. Raw UTF-8 in headers seems to be a complete non-starter. I recommend that this idea be abandoned in favor of an interoperable means, such as proposed by Kohn. It doesn't have to be Kohn's document (I neither know, nor care, what religious or personality issues may be involved), but at least he's on the right track. -- Mark -- http://staff.washington.edu/mrc Science does not emerge from voting, party politics, or public debate.