On Nov 30, 2018, at 7:00 AM, Bill Cole <sausers-20150...@billmail.scconsult.com> wrote: > >> Since HTML is already getting rendered to text, then perhaps the conversion >> code should strip (literally, just delete) any zero-width characters during >> this conversion? That should make normal body rules, and Bayes, function >> properly, no? > > Not if they are *looking for* those characters.
But AFAIK we're only looking for those characters with rawbody rules, because it's really hard to search for them in regular body rules... no? I'm not trying to advocate for removal of rawbody rules, but rather making it easier for normal body rules to work. But RW's suggestion is probably a good one: offer both paths: On Nov 30, 2018, at 7:46 AM, RW <rwmailli...@googlemail.com> wrote: > > It make it harder to write rules detecting these tricks, but it may > happen eventually. As far as Bayes is concerned, it would be a shame to > lose the information. I'm not sure I see how Bayes can take decent advantage out of these zero-width chars. If they are interspersed randomly within words, then Bayes has to tokenize each and every permutation (or, at least, very many permutations) of each word in order to be decently effective. But if the zero-width chars are stripped out, then Bayes only has to tokenize the regular, displayable word. Am I missing something? But offering both converted and non-converted options is likely the best option, and then having Bayes work on the normalized version resolves the above. --- Amir