Re: Default SpamAssassin scores don't make sense
Matt Kettler writes: Adam Katz wrote: Theo Van Dinter wrote: http://wiki.apache.org/spamassassin/HowScoresAreAssigned Thanks, that's what I was looking for. The short version is that as far as SA and the perceptron (that which generates the scores) are concerned, rules are independent. There is no increase in severity, either a rule hits or it doesn't Bayes is a perfect example of this, and is mentioned as such on the very page you referenced. Several filters, including those that I listed at the top of this thread, are indeed incremental, increasing in severity. I am shocked to hear that there is nobody moderating the automated scores (an Alan Greenspan of the anti-spam world, per se). Nobody said that nobody moderates the scores. I myself spend a considerable amount of time studying them. However, none of us is so rash as to make adjustments just to make the results look better. 99% of the time, investigations into illogical scores turn up real-world evidence that explains them. Let's take a brief look at your SPF expample. You'd expect SPF_FAIL to have a higher score than SPF_SOFTFAIL. However, the real world shows otherwise. Let's rip the results out of STATISTICS-set3.txt: OVERALL% SPAM% HAM% S/ORANK SCORE NAME 3.437 4.8942 0.03960.992 0.801.38 SPF_SOFTFAIL 2.550 3.5717 0.16760.955 0.531.14 SPF_FAIL Look at the S/O for each. This represents what percentage of mail the rule matched is actually spam, where 1.00 means 100% of the matching messages were spam. Notice how the S/O of SPF_FAIL is actually LOWER than SOFTFAIL? Why? Probably because there are more aggressive admins publishing records with -all without thinking about their whole network. The more cautious folks who have spent a lot of time thinking about their network, are more likely to realize them might have missed something and use ~all (softfail). Human behavior is in no way linear, and SPF here is a result of the behavior of the admin publishing the records. My explanation is a guess, but it makes sense if you think about the generall behaviors of cautious admin compared to a rabbid one. Now let's look at DATE_IN_FUTURE.. 1.605 2.2815 0.02640.989 0.751.96 DATE_IN_FUTURE_03_06 0.926 1.2926 0.07160.948 0.561.67 DATE_IN_FUTURE_06_12 1.986 2.8309 0.01510.995 0.812.77 DATE_IN_FUTURE_12_24 0.260 0.3676 0.00750.980 0.532.69 DATE_IN_FUTURE_24_48 0.089 0.1252 0.00380.971 0.402.10 DATE_IN_FUTURE_48_96 0.245 0.3474 0.00750.979 0.522.40 DATE_IN_FUTURE_96_XX Here again we see non-linearity in the S/O performance of the real world data. Note that 06_12 has the lowest S/O of the lot, and, imagine that, it got the lowest score too. There's some degree of non-fit here, as DATE_IN_FUTURE_96_XX has the highest score, but not the highest S/O. A study of the actual corpus itself would likely show that this rule is more likely to match spam that has very few other rules matching, hence the higher score. This is a case of that interaction with other rules thing in my last message. HTML_OBFUSCATE is a bit more complicated: OVERALL% SPAM% HAM% S/ORANK SCORE NAME 0.637 0.9048 0.01320.986 0.661.45 HTML_OBFUSCATE_05_10 0.921 1.3128 0.00750.994 0.741.77 HTML_OBFUSCATE_10_20 0.671 0.9582 0.1.000 0.703.40 HTML_OBFUSCATE_20_30 0.406 0.5801 0.1.000 0.632.86 HTML_OBFUSCATE_30_40 0.198 0.2836 0.1.000 0.512.64 HTML_OBFUSCATE_40_50 0.242 0.3458 0.1.000 0.542.03 HTML_OBFUSCATE_50_60 0.081 0.1155 0.1.000 0.401.65 HTML_OBFUSCATE_60_70 0.055 0.0784 0.1.000 0.381.47 HTML_OBFUSCATE_70_80 0.012 0.0178 0.1.000 0.310.98 HTML_OBFUSCATE_80_90 0.004 0.0057 0.1.000 0.290.00 HTML_OBFUSCATE_90_100 Here the S/O's have a clear up-swing trend. However, the hit-rates at the upper end are very low. That's probably what's suppressing the scores of 60_70 and higher. They just don't hit enough mail to be relevant. Yep. It may also be that they hit only spam that is *already* scoring over 10 points -- at that stage, there's no point in adding to the score, so whatever value the perceptron assigns to it would have no real effect. Therefore the perceptron is free to assign low scores. --j.
Default SpamAssassin scores don't make sense
(re-sending this email, last one sent 10/30 15:19 EST and not posted to list, despite that another message was sent to the list successfully only half an hour later.) Why do default scores not increase with severity? For example, SpamAssassin 3.1.7 has inconsistent progression of default scores in html obfuscation, dates set in the future, and spf marking: score HTML_OBFUSCATE_05_10 1.421 1.169 1.522 1.449 score HTML_OBFUSCATE_10_20 1.936 1.397 2.371 1.770 score HTML_OBFUSCATE_20_30 2.720 2.720 3.145 3.400 score HTML_OBFUSCATE_30_40 2.480 2.480 2.867 2.859 score HTML_OBFUSCATE_40_50 2.160 2.160 2.498 2.640 score HTML_OBFUSCATE_50_60 2.049 2.061 2.342 2.031 score HTML_OBFUSCATE_60_70 1.637 1.592 1.892 1.652 score HTML_OBFUSCATE_70_80 1.440 1.507 1.680 1.472 score HTML_OBFUSCATE_80_90 1.244 1.191 1.397 0.982 score HTML_OBFUSCATE_90_100 0 # n=0 n=1 n=2 n=3 score DATE_IN_FUTURE_03_06 2.061 2.007 2.275 1.961 score DATE_IN_FUTURE_06_12 1.680 1.498 1.883 1.668 score DATE_IN_FUTURE_12_24 2.320 2.316 2.775 2.767 score DATE_IN_FUTURE_24_48 2.080 2.080 2.498 2.688 score DATE_IN_FUTURE_48_96 1.680 1.680 1.942 2.100 score DATE_IN_FUTURE_96_XX 1.920 1.888 2.276 2.403 score SPF_NEUTRAL 0 1.379 0 1.069 score SPF_SOFTFAIL 0 1.470 0 1.384 score SPF_FAIL 0 1.333 0 1.142 To keep this message on-topic, I am not commenting about whether the scores are fair to message spaminess. I am asking about their fairness to other relative levels; HTML_OBFUSCATE_80_90 should be higher than HTML_OBFUSCATE_20_30, DATE_IN_FUTURE_96_XX should be higher than DATE_IN_FUTURE_12_24, and SPF_FAIL should be higher than SPF_SOFTFAIL. There are a large number of sets of scores that seem quite arbitrary in their assignment. While I'm happy to see this no longer includes Bayesian scores, it is still a huge surprise. Is there an explanation guide online about how scores are chosen? Is this automated in some manner that seems to get incremental tests weighted based more on frequency than on severity? I try to keep my rules tweaks minor, but my local.cf is getting bigger and bigger... how large is the typical local.cf for servers with 25-100 users? Thank you, Adam Katz
Re: Default SpamAssassin scores don't make sense
On Mon, 6 Nov 2006, Adam Katz wrote: Why do default scores not increase with severity? For example, SpamAssassin 3.1.7 has inconsistent progression of default scores in html obfuscation, dates set in the future, and spf marking: The default scores are generated by analyzing their performance against hand-categorized corpa of actual emails. If a rule hits spam often and ham rarely, it will be given a higher score than one that hits spam often and ham occasionally. Rule performance against real-world traffic can be counterintuitive, and the rules' relation to each other isn't necessarily a part of the analysis. I'm sure somebody else will chime in with a relevant wiki URL... -- John Hardin KA7OHZICQ#15735746http://www.impsec.org/~jhardin/ [EMAIL PROTECTED]FALaholic #11174pgpk -a [EMAIL PROTECTED] key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- The difference between ignorance and stupidity is that the stupid desire to remain ignorant. -- Jim Bacon --- Tomorrow: the campaign ads stop
Re: Default SpamAssassin scores don't make sense
On Mon, Nov 06, 2006 at 04:58:37PM -0500, Adam Katz wrote: Why do default scores not increase with severity? For example, SpamAssassin 3.1.7 has inconsistent progression of default scores in html obfuscation, dates set in the future, and spf marking: http://wiki.apache.org/spamassassin/HowScoresAreAssigned The short version is that as far as SA and the perceptron (that which generates the scores) are concerned, rules are independent. There is no increase in severity, either a rule hits or it doesn't weighted based more on frequency than on severity? I try to keep my rules tweaks minor, but my local.cf is getting bigger and bigger... how large is the typical local.cf for servers with 25-100 users? Most people, I think, leave most of the scores alone, which is good and bad. FWIW, the suggested way to get the best SA performance for your mail server is to generate your own score sets from your own mails. I don't actually know of anyone who does this though. -- Randomly Selected Tagline: These periods are always 15 minutes shorter than I'd like them, and probably 15 minutes longer than you'd like them. - Prof. Van Bluemel pgpzTIFEX9QNu.pgp Description: PGP signature
Re: Default SpamAssassin scores don't make sense
On Mon, 6 Nov 2006, John D. Hardin wrote: The default scores are generated by analyzing their performance against hand-categorized corpa of actual emails. If a rule hits spam often and ham rarely, it will be given a higher score than one that hits spam often and ham occasionally. That sounds very Bayesian ... with Bayesian rules already doing that sort of logic, I would hope there is more human thinking put into score setting. The bayes rules are very shiny and effective, but they are supposed to assist the hand-drawn filters rather than have the filters assist the bayes rules. ... if that's the current SA thinking, I'll have to re-consider CRM114 and other better-than-bayes systems. Rule performance against real-world traffic can be counterintuitive, and the rules' relation to each other isn't necessarily a part of the analysis. That's where the human tweaking is supposed to happen; if gobs of spam flag the 80% meter of some test while no ham does, and the 90% meter is almost never hit by anything, it should have a higher value than the 80% meter does. If the 90% meter has more ham than spam despite the 80% meter having more spam than ham, the tests need to be closely looked at rather than inappropriately weighted. just my two cents, anyway -Adam Katz
Re: Default SpamAssassin scores don't make sense
Theo Van Dinter wrote: http://wiki.apache.org/spamassassin/HowScoresAreAssigned Thanks, that's what I was looking for. The short version is that as far as SA and the perceptron (that which generates the scores) are concerned, rules are independent. There is no increase in severity, either a rule hits or it doesn't Bayes is a perfect example of this, and is mentioned as such on the very page you referenced. Several filters, including those that I listed at the top of this thread, are indeed incremental, increasing in severity. I am shocked to hear that there is nobody moderating the automated scores (an Alan Greenspan of the anti-spam world, per se). weighted based more on frequency than on severity? I try to keep my rules tweaks minor, but my local.cf is getting bigger and bigger... how large is the typical local.cf for servers with 25-100 users? Most people, I think, leave most of the scores alone, which is good and bad. FWIW, the suggested way to get the best SA performance for your mail server is to generate your own score sets from your own mails. I don't actually know of anyone who does this though. The wiki documentation seems to discourage modifying rule scores more than encourage it. We have a dozen or so custom rules and several dozen score modifications, plus a good number of the CustomRulesets from the wiki and the SARE collection are in full use. All low-scoring caught spam at my company gets caught in a net for my IT staff to review, with the rare false positives getting forwarded to the intended recipients, sa-learn'ed as ham, and offending scores get reviewed. A good 20-50% of the low-scoring caught spam was caught only due to our custom filters and adjusted scores (note, these numbers are with SA 2.63; our upgrade to 3.1.7 is scheduled for before thanksgiving while I work out the kinks). -Adam
Re: Default SpamAssassin scores don't make sense
Adam Katz wrote: Theo Van Dinter wrote: http://wiki.apache.org/spamassassin/HowScoresAreAssigned Thanks, that's what I was looking for. The short version is that as far as SA and the perceptron (that which generates the scores) are concerned, rules are independent. There is no increase in severity, either a rule hits or it doesn't Bayes is a perfect example of this, and is mentioned as such on the very page you referenced. Several filters, including those that I listed at the top of this thread, are indeed incremental, increasing in severity. I am shocked to hear that there is nobody moderating the automated scores (an Alan Greenspan of the anti-spam world, per se). Nobody said that nobody moderates the scores. I myself spend a considerable amount of time studying them. However, none of us is so rash as to make adjustments just to make the results look better. 99% of the time, investigations into illogical scores turn up real-world evidence that explains them. Let's take a brief look at your SPF expample. You'd expect SPF_FAIL to have a higher score than SPF_SOFTFAIL. However, the real world shows otherwise. Let's rip the results out of STATISTICS-set3.txt: OVERALL% SPAM% HAM% S/ORANK SCORE NAME 3.437 4.8942 0.03960.992 0.801.38 SPF_SOFTFAIL 2.550 3.5717 0.16760.955 0.531.14 SPF_FAIL Look at the S/O for each. This represents what percentage of mail the rule matched is actually spam, where 1.00 means 100% of the matching messages were spam. Notice how the S/O of SPF_FAIL is actually LOWER than SOFTFAIL? Why? Probably because there are more aggressive admins publishing records with -all without thinking about their whole network. The more cautious folks who have spent a lot of time thinking about their network, are more likely to realize them might have missed something and use ~all (softfail). Human behavior is in no way linear, and SPF here is a result of the behavior of the admin publishing the records. My explanation is a guess, but it makes sense if you think about the generall behaviors of cautious admin compared to a rabbid one. Now let's look at DATE_IN_FUTURE.. 1.605 2.2815 0.02640.989 0.751.96 DATE_IN_FUTURE_03_06 0.926 1.2926 0.07160.948 0.561.67 DATE_IN_FUTURE_06_12 1.986 2.8309 0.01510.995 0.812.77 DATE_IN_FUTURE_12_24 0.260 0.3676 0.00750.980 0.532.69 DATE_IN_FUTURE_24_48 0.089 0.1252 0.00380.971 0.402.10 DATE_IN_FUTURE_48_96 0.245 0.3474 0.00750.979 0.522.40 DATE_IN_FUTURE_96_XX Here again we see non-linearity in the S/O performance of the real world data. Note that 06_12 has the lowest S/O of the lot, and, imagine that, it got the lowest score too. There's some degree of non-fit here, as DATE_IN_FUTURE_96_XX has the highest score, but not the highest S/O. A study of the actual corpus itself would likely show that this rule is more likely to match spam that has very few other rules matching, hence the higher score. This is a case of that interaction with other rules thing in my last message. HTML_OBFUSCATE is a bit more complicated: OVERALL% SPAM% HAM% S/ORANK SCORE NAME 0.637 0.9048 0.01320.986 0.661.45 HTML_OBFUSCATE_05_10 0.921 1.3128 0.00750.994 0.741.77 HTML_OBFUSCATE_10_20 0.671 0.9582 0.1.000 0.703.40 HTML_OBFUSCATE_20_30 0.406 0.5801 0.1.000 0.632.86 HTML_OBFUSCATE_30_40 0.198 0.2836 0.1.000 0.512.64 HTML_OBFUSCATE_40_50 0.242 0.3458 0.1.000 0.542.03 HTML_OBFUSCATE_50_60 0.081 0.1155 0.1.000 0.401.65 HTML_OBFUSCATE_60_70 0.055 0.0784 0.1.000 0.381.47 HTML_OBFUSCATE_70_80 0.012 0.0178 0.1.000 0.310.98 HTML_OBFUSCATE_80_90 0.004 0.0057 0.1.000 0.290.00 HTML_OBFUSCATE_90_100 Here the S/O's have a clear up-swing trend. However, the hit-rates at the upper end are very low. That's probably what's suppressing the scores of 60_70 and higher. They just don't hit enough mail to be relevant.
Re: Default SpamAssassin scores don't make sense
... That's where the human tweaking is supposed to happen; if gobs of spam flag the 80% meter of some test while no ham does, and the 90% meter is almost never hit by anything, it should have a higher value than the 80% meter does. If the 90% meter has more ham than spam despite the 80% meter having more spam than ham, the tests need to be closely looked at rather than inappropriately weighted. just my two cents, anyway -Adam Katz Here one of your own examples pops up - SPF_FAIL vs. SPF_SOFT_FAIL. In the current state of the world, *most* soft fail results are actual forgeries, but *most* hard fail results are administrator or user error. So SOFT_FAIL is a better spam sign than FAIL - often these things can and do make sense when a rational explanation is looked for (but it can be very far from obvious at time). Hopefully as administrators learn, things like SPF, DK and/or DKIM will become more useful (~all and sign some are both serious dilutions of what the technologies has to offer). Paul Shupak [EMAIL PROTECTED]