On 7 May 2022, Henrik K. spake thusly:

> There's lots of common headers that are basically huge base64 strings,
> creating stupid amounts of random Bayes tokens.

Honestly I'm wondering if a simpler way to deal with these might simply
be to detect lengthy bayes64ed regions in headers (not actually that
difficult), try to unbase64 them and use *that* for Bayes, probably as a
new pseudo-header with a name derived from the old one, and with the
content dropped from the tokenization of the old one. Combine that with
an extra check: "words" containing control characters (in the range
0x0--0x1f) are not tokenized. (Maybe another check imposing a maximum
length on things Bayes considers words might be a good idea, but I'm not
sure if we do that already.)

It is clearly wrong to tokenize long base64 strings in any case, and
we already decode these in body text: maybe we should start doing
something similar for regions of headers, since this is such a common
thing for non-spam to do these days.

Reply via email to