Re: Score Hit Frequency in SA Corpus?

Matt Kettler Sat, 20 Sep 2008 18:27:03 -0700

Bob Proulx wrote:
> Are the hit frequencies from the SpamAssassin corpus available on the
> web somewhere?  I looked through the docs and wiki but didn't see it
> if they were.
>
> What is the hit frequency in the corpus of SUBJ_ALL_CAPS scoring 2.1?
> I wanted to know so that I could educate a sender that using all caps
> in a long subject makes it look significantly like spam but couldn't
> deduce the statistical numbers.
>   
It's included in the distribution tarball. In the rules subdirectory
check out STATISTICS-setX.txt, where X is the scoreset you're interested
in the stats for.


You can also grab them from the web image of SVN:

http://svn.apache.org/repos/asf/spamassassin/branches/3.2/rules/

And for what it's worth, the S/O is 0.855 in set 3.

However, bear in mind, scores are not assigned based on the S/O of the
rule alone. The whole ruleset is scored collectively as one giant group,
and tuned to get the best results.

A rule with a high-ish score, and not so great S/O suggests this rule's
false positives commonly coincide with strong negative scoring rules.
Based on that, the score assignment system will give it a "unfairly
high" score because it results in fewer FPs than assigning a higher
score to some other rule that has a better S/O, but its nonspam hits are
not compensated by negative scoring rule and would result in more FPs.

The whole thing gets a lot complicated, but when you start to realize
that every rule's score is not only a function of its own hit-rate, but
also what other rules it gets grouped with you start to get a feel for
what's going on. Of course, strictly evaluating all combinations of all
rules would be very computationally expensive, which is why we use a
perceptron which generates an estimate. (I believe it's an successive
approximation type deal, but I'm not deeply familiar with its internal
workings)

Re: Score Hit Frequency in SA Corpus?

Reply via email to