-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello Jack,
Saturday, July 5, 2003, 6:50:05 PM, you wrote: >> Train bayes. Everyone has a different bayes db, and they can't >> work around that centrally. JG> The problem I'm seeing is that I'm getting messages with a Bayes of JG> 90% but it still slips through with 4.5-5. Bayes is conservative. Of 3,629 emails I've logged with Bayes_90, only one has been not-spam (from Enterprise Rent-a-car, a 3-line email asking for confirmation of an email address). I run my system conservatively also, with a required-hits of 9 instead of the default 5. I've raised the score for Bayes_90 to 7 (the default is either 4 or 3, depending on other testing methods), and the score for Bayes_99 to 9 (IOW, Bayes_99 IS spam). I'm considering raising the score for Bayes_90 to somewhere in the 7.5 to 8.0 range. Any spam which scores under 10, and isn't already Bayes_99, gets fed back into Bayes to be learned as spam. (Likewise, any non-spam with a score over +1 is fed into Bayes to be learned as ham.) Bayes works wonders with these. I've also modified various other scores from the defaults, about 140 of them. Of those 140, I've reduced the scores for 3, and raised the scores for the rest. Some of these increases are probably required only because of my required-hits = 9, but others are because the distributed scores are themselves conservative, concentrating on avoiding false positives for EVERYONE, and I can be a bit more aggressive because I know things about my users (eg: *none* of them are interested in porn, none are in/near bankruptcy, etc). Most notable among those: CONFIRMED_FORGED -- 7.00 FAKED_UNDISC_RECIPS -- 6.00 FORGED_AOL_RCVD -- 5.50 FORGED_MUA_OIMO -- 5.10 FORGED_RCVD_TRAIL -- 7.00 FROM_OFFERS -- 5.00 HGH -- 9.44 LOW_INTEREST -- 6.23 NIGERIAN_BODY -- 9.10 PENIS_ENLARGE -- 5.00 PENIS_ENLARGE2 -- 5.00 RATWARE_EGROUPS -- 9.43 RATWARE_OE_MALFORMED -- 4.80 RCVD_FAKE_HELO_DOTCOM -- 5.50 REVERSE_AGING -- 6.21 SUSPICIOUS_RECIPS -- 4.00 TO_MALFORMED -- 4.10 TO_NO_USER -- 4.30 VIAGRA -- 5.00 WITH_LC_SMTP -- 6.25 (If anyone's interested, the ones I've reduced are: FROM_ENDS_IN_NUMS, FROM_NO_LOWER, and NO_REAL_NAME.) The way I determine these is by watching for false negatives, spam that slips through. I see what tests were matched, determine what the scores are, scan my own corpus to determine for myself whether these tests are matched by non-spam, and from that determine how much to raise the scores. Some scores which are only suggestive of spam I don't raise at all, or just minimally (0.1 or 0.2 max). Some scores which match only spam on my system are increased until they're about 50% of the required-hits (being careful), while others are increased to or over the required-hits value (being confident). My corpus of now contains a good 8k spam, 90% of which has gone through SA and so has test names in the headers (slightly over two months' worth), and 10k non-spam (my personal email for this year, plus the past month's email for other accounts I'm using SA against). My email client allows reg-ex searches, so I'm able to simulate many SA tests without running SA. It helps. Also, check out William Sterns' collected blacklist at http://www.stearns.org/sa-blacklist/sa-blacklist.current -- it's a marvelous resource which traps a whole lot of spam that would otherwise sneak through. JG> But, keep it in proportion. I'm still trapping over 98%. Just tweaking the distributed tests, using Sterns' blacklists, and adding a few blacklisted entries of my own (recently submitted to Sterns) brought me to the 99% mark. Two weeks ago I began creating my own rules to catch the rest. This week I have received 2002 emails, 13 of which were false negatives. That's 99.35%, and each and every one of those false negatives is now caught by some combination of the above tweaking. I actually had one day this week when I had zero false negatives -- first day I didn't receive spam to any email account in years. I'm looking forward to maybe having a whole week later this summer when no spam sneaks through. Bob Menschel -----BEGIN PGP SIGNATURE----- Version: PGP 8.0 iQA/AwUBPwerIJebK8E4qh1HEQJDVACghpnYWCr3Ay2NkjYkdOmJYlfHmsAAn0Sm Urg+IcRfpDlorCgVHB8wRIPq =5GIv -----END PGP SIGNATURE----- ------------------------------------------------------- This SF.Net email sponsored by: Free pre-built ASP.NET sites including Data Reports, E-commerce, Portals, and Forums are available now. Download today and enter to win an XBOX or Visual Studio .NET. http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk