Mike, I suspect you are using the wrong criterion in removing some of the
rules. Unfortunately none of the log readers seem to store the most
interesting bit of information. How many times did the SARE rules make a
critical difference between marking a spam message as spam? I find they
are a critical difference worth keeping around here, particularly the
stocks rules.

I'll take you up on that. I've attached a perl script I used to look through my last week's worth of mail logs. (It's a bit sloppy in parts, but it was meant to get results, not for real distribution.) It gets the name of each rule, the file it's in, and its score (the sole score if that's all there is, the fourth score if there's more than one, since I use both Bayes and network tests). It then looks in the logs for spam results (I throw out user bb, since I use a Big Brother test to check if spamd is functioning properly, and it only checks two messages - one ham I received a while back, and a GTUBE message) and the required hits to be considered spam. It then deducts each rule's score from the total hits and notes if that rule's score pushed it over the required hits threshold. Sound good? So, out of 163 spam messages, here's the files that pushed spams over the edge (files with no rules that pushed over the threshold are omitted):

20_advance_fee.cf 1
20_body_tests.cf 5
20_drugs.cf 2
20_fake_helo_tests.cf 2
20_head_tests.cf 23
20_html_tests.cf 16
20_meta_tests.cf 4
20_net_tests.cf 1
20_phrases.cf 5
20_ratware.cf 1
20_uri_tests.cf 6
23_bayes.cf 5
25_domainkeys.cf 3
25_pyzor.cf 1
25_razor2.cf 4
25_replace.cf 1
25_spf.cf 3
25_uribl.cf 6
70_sare_adult.cf 1
70_sare_oem.cf 1
70_sare_specific.cf 3
70_sare_spoof.cf 3
70_sare_stocks.cf 5
72_sare_redirect_post3.0.0.cf 1
99_sare_fraud_post25x.cf 1

(There were two spams that had an unidentified rule file. I'll assume those were the SARE header rules that I already removed.)

From that I would infer that the SARE stock ruleset is the most effective -
it was responsible for 5 out of 163 spams being identified. That leaves the other files I use - 70_sare_bayes_poison_nxm.cf, 70_sare_html0.cf, 70_sare_obfu0.cf, 70_sare_random.cf, 70_sare_whitelist_rcvd.cf, 70_sare_whitelist_spf.cf, and 70_sc_top200.cf - by the parameters of this test were not responsible for identifying a single spam. That's not the goal of the whitelist ones, but the others? I still wonder if they're really effective enough *for* *my* *user* *base* to spend the resources to run them, and I'd be curious if anyone else has produced similar stats and come to the same conclusion.

Attachment: blah.pl
Description: Binary data

Reply via email to