From: <[EMAIL PROTECTED]> Kristopher Austin wrote:
RANK RULE NAME COUNT %OFRULES %OFMAIL %OFSPAM %OFHAM ------------------------------------------------------------ 1 HTML_MESSAGE 45870 5.13 27.72 70.37 55.36
Wait... so 27% of all mail is HTML, 70% of spam is HTML, and 55% of ham is HTML? <<jdow>> So what's the problem? (He's not running Bayes or it's badly broken, though.) TOP SPAM RULES FIRED ------------------------------------------------------------ RANK RULE NAME COUNT %OFRULES %OFMAIL %OFSPAM %OFHAM ------------------------------------------------------------ 1 BAYES_99 962 4.81 32.97 93.04 0.11 2 RCVD_IN_XBL 574 2.87 19.67 55.51 0.05 3 HTML_MESSAGE 571 2.86 19.57 55.22 7.91 4 URIBL_BLACKB 563 2.82 19.29 54.45 0.05 5 URIBL_JP_SURBL 484 2.42 16.59 46.81 0.00 6 URIBL_SC_SURBL 479 2.40 16.42 46.32 0.00 7 RCVD_IN_BL_SPAMCOP_NET 440 2.20 15.08 42.55 3.34 8 URIBL_OB_SURBL 409 2.05 14.02 39.56 0.00 9 URIBL_WS_SURBL 403 2.02 13.81 38.97 0.00 10 URIBL_SBL 397 1.99 13.61 38.39 0.05 11 URIBL_AB_SURBL 368 1.84 12.61 35.59 0.00 12 JD_TO_EARTHLINK 357 1.79 12.23 34.53 2.71 13 RCVD_IN_SORBS_DUL 270 1.35 9.25 26.11 0.53 14 RCVD_IN_DSBL 253 1.27 8.67 24.47 0.00 15 URIBL_XS_SURBL 241 1.21 8.26 23.31 0.00 16 LW_MULT_RECIP3 237 1.19 8.12 22.92 2.60 17 JD_MY_NAME 231 1.16 7.92 22.34 2.71 18 DNS_FROM_RFC_POST 194 0.97 6.65 18.76 0.11 19 MIME_HTML_ONLY 192 0.96 6.58 18.57 0.80 20 JD_TO_EARTHLINKCOM 189 0.95 6.48 18.28 0.00 ------------------------------------------------------------ TOP HAM RULES FIRED ------------------------------------------------------------ RANK RULE NAME COUNT %OFRULES %OFMAIL %OFSPAM %OFHAM ------------------------------------------------------------ 1 BAYES_00 1654 20.19 56.68 0.39 87.79 2 JD_LKML_RELAY 787 9.61 26.97 0.77 41.77 3 JD_PATCH_SUBJ 316 3.86 10.83 0.00 16.77 4 RATWR10a_MESSID 287 3.50 9.84 2.71 15.23 5 JD_CHICKENPOX 247 3.02 8.46 11.90 13.11 6 NOT_TO_ME 231 2.82 7.92 16.54 12.26 7 RCVD_BY_IP 183 2.23 6.27 11.70 9.71 8 HTML_MESSAGE 149 1.82 5.11 55.22 7.91 9 UHS_BCW 135 1.65 4.63 0.10 7.17 10 SARE_MSGID_LONG40 120 1.46 4.11 0.19 6.37 11 JD_MANGY_MORTGAGES 118 1.44 4.04 11.61 6.26 12 USER_IN_WHITELIST 111 1.35 3.80 0.00 5.89 13 JD_GENERIC 90 1.10 3.08 0.87 4.78 14 BAYES_50 78 0.95 2.67 0.97 4.14 15 HELO_EQ_LT4_SA 76 0.93 2.60 4.16 4.03 16 BAYES_20 63 0.77 2.16 0.10 3.34 17 RCVD_IN_BL_SPAMCOP_NET 63 0.77 2.16 42.55 3.34 18 FM_MULTI_ODD2 61 0.74 2.09 6.09 3.24 19 WHITELIST_NTDEV 61 0.74 2.09 0.00 3.24 20 JD_MANGY_MORTGAGE 60 0.73 2.06 0.48 3.18 ------------------------------------------------------------ ==========8<------------- Note that there are a lot of rules in there which I intentionally score at the 0.01 point level so I see them explicitly. I use them in meta rules to create some interesting special cases that are rather effective. (And as for the HTML - consider that I am on a lot of mailing lists that are basically text only and have high volume. Note that the results are a little "skewed" with reality. I live with a fairly high false positive rate for the LKML nonsense. So that will affect the false positive rate on some of the BLs, for example. {^_-}