On 7 May 2022, Henrik K. spake thusly: > There's lots of common headers that are basically huge base64 strings, > creating stupid amounts of random Bayes tokens.
Honestly I'm wondering if a simpler way to deal with these might simply be to detect lengthy bayes64ed regions in headers (not actually that difficult), try to unbase64 them and use *that* for Bayes, probably as a new pseudo-header with a name derived from the old one, and with the content dropped from the tokenization of the old one. Combine that with an extra check: "words" containing control characters (in the range 0x0--0x1f) are not tokenized. (Maybe another check imposing a maximum length on things Bayes considers words might be a good idea, but I'm not sure if we do that already.) It is clearly wrong to tokenize long base64 strings in any case, and we already decode these in body text: maybe we should start doing something similar for regions of headers, since this is such a common thing for non-spam to do these days.
