At 10:29 AM 11/7/2003, Maarten J H van den Berg wrote:

Sorry if this has been discussed in the past...

It's been discussed many times.. It's very common for people to have a very deep misunderstanding of how SA scoring works. Most people fall into the trap of over-simplifying the problem, and simply assuming that some rule or another "must" be a good spam rule, when in fact it's not.


Of course this is open to debate, but then again that's all I want;
possibly a debate about how accurate the scoring is right now...

That's fine.. but in the next round you're going to have to do a LOT more homework.. you're over-simplifying things by merely looking at the name of the rule... You're not looking at it's performance levels, it's impact on nonspam, or it's interactions with other rules.


Questioning the accuracy of the scoring system isn't unreasonable.. but the scoring system is VASTLY more complicated than you can understand in a few hours of study. You need to have a good understanding of how it really works, and just how complicated the balance of the scoring system is before you can make reasonable judgements about accuracy.

You need to realize the SA scoring system is somewhat analogous to curve fitting an equation with 873 variables (there are 873 rules in SA 2.60's 50_scores.cf). This is done as an approximation using a genetic algorithm to evolve a solution, since a direct solution would take too long to compute. Trying to get your mind completely around an equation with that many variables is not possible for most humans, including me, but I've learned to understand and respect how complex the problem is.


List 1:
score ALL_CAP_PORN 0.650 0.669 0 0
score PENIS_ENLARGE2 0.500 0.590 0 0.501
score UPPERCASE_50_75 0.794 1.137 0 0
score V+AG+A_ONLINE 1.100 1.101 3.151 4.056

If it were up to me, I'd say that giving only half a point to a mail that
scores PENIS_ENLARGE2 is...  well, ludicrous.  Let's not kid ourselves.
IF there are people who participate on a genuine mailinglist that
discusses penis enlargement, let the burden fall on them to put those
adresses in their whitelist, not the reverse.

OK, being that it's not up to you, let's look at the real-world performance of these rules from STATISTICS.txt


OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
  1.010   1.5010   0.0893    0.944   0.80    0.65  ALL_CAP_PORN
  2.962   4.5216   0.0418    0.991   0.93    0.50  PENIS_ENLARGE2
  0.580   0.8552   0.0645    0.930   0.77    0.79  UPPERCASE_50_75
  1.040   1.5930   0.0032    0.998   0.95    1.10  V+AG+A_ONLINE

*yawn*.. none of these rules has particularly impressive hit rates, so they aren't very significant in the grand scheme of SA. A meager 4.5% of spam hits isn't impressive, although not useless.

Some of them, such as ALL_CAP_PORN and UPPERCASE_50_75 have really bad quantities of nonspam hits. Anything with a S/O under 90 pretty much doesn't deserve a high score because 10% of the email that the rule matches is nonspam. In the case of these two, both have at least 20% of their hits being nonspam mail.. ouch.

Quite frankly, UPPERCASE_50_75 performs so badly it doesn't even meet the criteria to avoid being dropped from the ruleset, but is probably retained for completeness with the other rules. (in general spam rules need to have an S/O of .80 or higher to be deemed "worthwhile".. anything less isn't a very good indicator of spam and is just a waste of time).

In the case of the other two, you need to start looking at the larger ecosystem of the entire ruleset.. SA rules are not scored based on the merits of the rule alone.. the entire ruleset is scored together, and the scores of all the rules are tuned to try to get the most spam and nonspam placed in the proper piles.

Often times the score of a rule is the result of it's interaction with other rules. Take our PENIS_ENLARGE2 rule. This rule can quite possibly match some nonspam crude joke emails.. Other spam rules will likely match these as well, resulting in a high score.

Now, the GA is designed to treat false positives as 100 times worse than false negatives, so this is a very drastic situation for the GA. Faced with this problem, the proper thing for the GA to do is to try to reduce the score of the rule that affects the least amount of the spam pile.. well, given that PENIS_ENLARGE2 only matches 4.5% of spam, it's a good candidate for reduction.












------------------------------------------------------- This SF.Net email sponsored by: ApacheCon 2003, 16-19 November in Las Vegas. Learn firsthand the latest developments in Apache, PHP, Perl, XML, Java, MySQL, WebDAV, and more! http://www.apachecon.com/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to