Matt Kettler wrote: > If you try to build it off a live feed and use SA's marking as the spam > criteria, your statistics are useless. Any rule with a high enough score > would get "perfect" results.. all the mail it matched would be spam, and > no nonspam. You have, essentially, created a "self fulfilling prophecy". > The higher-scoring a rule is, the more likely messages that match it > will be tagged as spam, even if they're not really spam. > Self correction. Such stats aren't "useless", it depends on what you want out of them.
If you want to know how accurate a particular rule is, by comparing the spam vs nonspam hit rates, those stats are useless, because of the bias. You need a manually sorted corpus to get this kind of information. If you want to see which rules are getting used a lot, vs those that are rarely getting used, these stats are quite useful. If you want a "top x rules" list, sa-stats can do that for you: http://www.rulesemporium.com/programs/sa-stats.txt It will parse a spamd logfile and report the most-frequently used spam and nonspam rules (and you can configure how many it will list for each) > > >