On Thu, 2019-03-21 at 09:23 -0700, John Hardin wrote: > On Thu, 21 Mar 2019, Savvas Karagiannidis wrote: > > > What should be considered is the message's language. All messages > > that were > > false positives had the following mime encoding (messages were > > actually in > > greek): > > > > Content-Type: text/[plain|html]; charset="windows-1253" or > > Content-Type: text/[plain|html]; charset="iso-8859-7" > > > > while all messages that were actual spam and were properly detected > > had: > > > > Content-Type: text/[plain|html]; charset="utf-8" > > It should be fairly easy to add an exclusion based on that > information. > However, that information may well be leveraged by spammers who are > using that obfuscation... > FWIW roughly 10% of my spam corpus uses <font> tags to set white text. The ratio of using "white" to "#ffffff" to 1/3 - 2/3. I should say that some of these messages are quite old - I keep them as test data when I'm writing new rules: they are NOT used for Bayes training.
My mail archive contains 192540 messages in theory it contains no spam apart, that is, from a small amount of spam eeled its way in. 145 messages in it contain 'color="white"' and 2293 contain 'color="#ffffff"' The combination makes up 1.27% of the archived messages. My take is that so it would appear that it may deserve a small score, but it is probably best used as a subrule. Martin