On Mon, 2004-05-17 at 12:40, Theo Van Dinter wrote: > On Sun, May 16, 2004 at 12:47:38PM -0500, Chris Thielen wrote: > > My first results should be early tomorrow morning. If anything looks > > fishy, let me know. > > I usually look at the resulting statistics file from my nightly run and > checks for FPs on the top rules. My runs have the top rules mostly with > 0 FPs, so it's easy to spot issues of misfiled spam. > > For instance, you have FPs for some of the DRUGS* rules, MPART_ALT_DIFF, > etc.
OK, looking over my results... First, I realized my corpus isn't as clean as I had thought. I've noticed that my ham corpus has been tainted by some non-expunged spam that had been exported as ham. I'm re-exporting tonight and will be more conscious of expunging before exporting. Second, where is the appropriate place for discussion of FPs? Bugzilla, sa-dev or elsewhere? Third, I'm wondering what the thought is on age of ham corpora. I'm getting several FPs on (for instance) MPART_ALT_DIFF, some of which are from older ham (a few spammy-looking legitimate mailings from nextcard.com in 2000). Do I purge these messages from my corpus assuming they're from an broken ancient mailer or should they be tallied as usual? Do I simply narrow my ham corpus to 6 months or younger like my spam corpus? A quick chat with DQ on freenode indicated he uses both his full corpus and a smaller/newer subset depending on the occasion. -- Chris Thielen Easily generate SpamAssassin rules to catch obfuscated spam phrases (0BFU$C/\TED SPA/\/\ P|-|RA$ES): http://www.sandgnat.com/cmos/ Keep up to date with the latest third party SpamAssassin Rulesets: http://www.exit0.us/index.php/RulesDuJour
