Matt Kettler wrote:
> If you try to build it off a live feed and use SA's marking as the spam
> criteria, your statistics are useless. Any rule with a high enough score
> would get "perfect" results.. all the mail it matched would be spam, and
> no nonspam. You have, essentially, created a "self fulfilling prophecy".
> The higher-scoring a rule is, the more likely messages that match it
> will be tagged as spam, even if they're not really spam.
>   
Self correction. Such stats aren't "useless", it depends on what you
want out of them.

If you want to know how accurate a particular rule is, by comparing the
spam vs nonspam hit rates, those stats are useless, because of the bias.
You need a manually sorted corpus to get this kind of information.

If you want to see which rules are getting used a lot, vs those that are
rarely getting used, these stats are quite useful.

If you want a "top x rules" list, sa-stats can do that for you:

http://www.rulesemporium.com/programs/sa-stats.txt

It will parse a spamd logfile and report the most-frequently used spam
and nonspam rules (and you can configure how many it will list for each)

>
>   
>

Reply via email to