J.B. Moreno wrote:

152,000 have at least one 8bit character, of which
26 match the utf-8 rule, of which
17 appear to be false matches

So, 26/152,000 = .017% 17/152,000 = .011%
No, no, no; the rate of false positives (again assuming
that one knows the real charset) is the ratio of the
false matches to the total matching the utf-8 rule, or
17 / 26  which is greater than 65%.

What you have given is the percentage of presumed utf-8
use (*including* false positives) out of all untagged,
unencoded use, and the ratio of those presumed false
positives to a total number of header lines, which is
a meaningless ratio.  One thing that is clear is that
of articles with 8-bit header content, *at* *least*
99.9% are using something other than utf-8.  Under such
conditions any attempt to ram utf-8 down the throats of
those users as a "blessed" charset is going to have
predictable results (i.e. the attempt will fail). On the
other hand something like 84% are compliant with RFC 1036
etc., and the 15% of non-compliant fields could concievalbly
be corrected by fixing a few broken UAs.  It's reasonable
to believe that a small but significant minority of bad
articles can be corrected; it's inconcievable that 99.9%
will magically switch to what less than 0.02% are using.

Out of 91610 subject headers containing 8-bit (just under a day's worth),
only 49 matched this (Perl) regexp:
-snip-

31 of them were a binary series with an English subject
line in which the word "Can't" had been spelled with U+00B4 (acute
accent)

49-31=18 (I'll assume that you'll consider the above as sufficient
identification of those 31 at least) and 18/91610=.019%.
Again you're using the wrong denominator. 18 / 49 = 36.7%
Whether or not the use in the 31 cases cited is sufficient
identification depends on the specific header field content.
If the single character with hex value 0xb4 was presented
as a two-octet sequence 0xc2 0xd4, that's one thing; if it
was a single octet with value 0xb4, it could be in any
number of charsets, including a number of iso-8859 variants
and a number of MS windows- variants. Whether or not that
group of articles is significant (or a bunch of related
follow-ups) is another matter entirely.

So we have a range of between .011% all the way up to .019% for false
positives.
No, that's a range of meaningless ratios, not a
percentage of false positives, which for the two
sets of figures given is substantial, around 50%
or about as good as flipping a fair coin.

It's quite clear on two points: raw utf8 usage is extremely low, false
positives on checks for utf8 is likewise extremely low.
Yes it's clear that raw utf-8 usage is extremely low. But
assumptions that any sequence that matches a valid utf-8
sequence are as likely to be wrong as right.  Something
that provides no better assurance than random guessing
can hardly be claimed to be reliable.  And the data are
for Subject header fields only, in newsgroups not excluded
from the analysis (it's not at all clear why the *Subject*
fields of articles that happened to be posted to a particular
set of newsgroups were excluded).

And it's clear that those quoting < 0.1 % "false positive"
ratios are quoting the wrong numbers.  Which is not surprising
given a) the religious fervor involved, and b) the reality of
small bits of text, where one would expect a high false positive
rate (anything under 20% would be suspect).

Reply via email to