Bob Proulx wrote: > Are the hit frequencies from the SpamAssassin corpus available on the > web somewhere? I looked through the docs and wiki but didn't see it > if they were. > > What is the hit frequency in the corpus of SUBJ_ALL_CAPS scoring 2.1? > I wanted to know so that I could educate a sender that using all caps > in a long subject makes it look significantly like spam but couldn't > deduce the statistical numbers. > It's included in the distribution tarball. In the rules subdirectory check out STATISTICS-setX.txt, where X is the scoreset you're interested in the stats for.
You can also grab them from the web image of SVN: http://svn.apache.org/repos/asf/spamassassin/branches/3.2/rules/ And for what it's worth, the S/O is 0.855 in set 3. However, bear in mind, scores are not assigned based on the S/O of the rule alone. The whole ruleset is scored collectively as one giant group, and tuned to get the best results. A rule with a high-ish score, and not so great S/O suggests this rule's false positives commonly coincide with strong negative scoring rules. Based on that, the score assignment system will give it a "unfairly high" score because it results in fewer FPs than assigning a higher score to some other rule that has a better S/O, but its nonspam hits are not compensated by negative scoring rule and would result in more FPs. The whole thing gets a lot complicated, but when you start to realize that every rule's score is not only a function of its own hit-rate, but also what other rules it gets grouped with you start to get a feel for what's going on. Of course, strictly evaluating all combinations of all rules would be very computationally expensive, which is why we use a perceptron which generates an estimate. (I believe it's an successive approximation type deal, but I'm not deeply familiar with its internal workings)