Re: A different approach to scoring spamassassin hits
On Friday 29 June 2007, Tom Allison wrote: It would be the Bayes process that determines the effective number of points you assign for each HIT based on what it's learned about it from you. So the tags of: ADVANCE_FEE_1, ADVANCE_FEE_2 would be represented as a token of format: ADVANCE_FEE_1=YES or NO ADVANCE_FEE_2=YES or NO and each of these tokens would then be evaluated based on your learning process. Sort of like a multiple linear regression analysis, where you simply start dropping terms with low coefficients to simplify the calculation. Interesting Idea. You have a bit of a chicken and egg problem at the start. Until some learning takes place in the system. -- _ John Andersen
Re: A different approach to scoring spamassassin hits
On Jun 30, 2007, at 1:20 AM, Marc Perkel wrote: Tom Allison wrote: For some years now there has been a lot of effective spam filtering using statistical approaches with variations on Bayesian theory, some of these are inverse Chi Square modifications to Niave Bayes or even CRM114 and other languages have been developed to improve the scoring of statistical analysis of spam. For all statistical processes the spamicity is always between 0 and 1. snip Many Thanks for those of you who have read this far for your patience and consideration. Tom, I suggested something somilar to that years ago and I'd still like to see it tried out. I wonder what would happen if you stripped ot the body and ran bayes just on the headers and the rules and let bayes figure it out. You do have to have some points to start with to get bayes pointed in the right direction. But you could use black lists and white lists to do bayes training. Also needs more rules to identify ham and not just rules to identify spam. I was under the belief that there were Ham-centric tests that would result in negative point scorings. Ham doesn't try to be evasive. It's pretty easy to identify. Without SA tagging much of it falls to 0.5 and whitelisting would capture much of the exceptions. As for headers only testing -- The first five lines of stock spam is very telling... My question about SA is the PerMsgStatus (I think) Is this the place to retrieve all the rules information? I know today you can get a list of all the rules that HIT, but is there where you would look to find all the rules that were attempted? Or is there a better place for it?
Re: A different approach to scoring spamassassin hits
On Jun 30, 2007, at 4:46 AM, John Andersen wrote: On Friday 29 June 2007, Tom Allison wrote: It would be the Bayes process that determines the effective number of points you assign for each HIT based on what it's learned about it from you. So the tags of: ADVANCE_FEE_1, ADVANCE_FEE_2 would be represented as a token of format: ADVANCE_FEE_1=YES or NO ADVANCE_FEE_2=YES or NO and each of these tokens would then be evaluated based on your learning process. Sort of like a multiple linear regression analysis, where you simply start dropping terms with low coefficients to simplify the calculation. Interesting Idea. You have a bit of a chicken and egg problem at the start. Until some learning takes place in the system. For a purely bayesian filter this is always the case. But I have found through mailing lists and personal experience that this can be mitigated through a variety of approaches. The first approach is to impliment SA after you have trained it from some past corpus of mail you've captured. The opinion on how many you need to be effective varies from 10's to 1,000's. This is strictly a YMMV issue. Personally, I use an approach of train on error (never auto-train or train on everything but only the minimum to get right) with a result of 10 emails gets me above 90%. But my scoring is a little vague -- I use a ternary Yes, No, Maybe scoring process. If I exclude the Maybe I have 100% success in very short order. Including Maybe I have 98% success after training on ~100 messages. But the worse is over in the first day. Another method would be to simply seed the data from a SQL script to preload certain tokens and values. Kind of a hack in my opinion but it would be effective and any discrepancies would be quickly resolved by training. In the case of SA I would seed the rules into the tables for the simplest, yet effective results.
config clarification
For configuration options listed in perldoc Mail::SpamAssassin can I put the settings into local.cf? Mail::SpamAssassin::Conf says yes, but it doesn't say it applies to args for Mail::SpamAssassin-new(); And what does 'save_pattern_hits' get me that I otherwise wouldn't have?
Re: A different approach to scoring spamassassin hits
You have a bit of a chicken and egg problem at the start. Until some learning takes place in the system. Two possibilities. The rules exist and have scores. Assume they are maintained, for whatever reason. 1.Until Bayes has enough info to kick in, classification is done by the scores. Then when Bayes kicks in the scores turn off (insofar as adding to themessage score, they might still show up as tokens in the message that Bayes will process). 2.Divide all the scores by 10 or 20. The leave them on. Pretty soon bayes will override almost any reasonable score combination. BTW, while ham rules are possible, SA has almost no ham rules; perhaps two or so. Spammers long ago found they could write their spams to match ham rules and thus bypass SA. Thus, no ham rules, no spmammer workarounds. Of course personal or ste specific ham rules will generally still work, since they will not be public knowledge and spammers won't be able to target them. I suspect you can find all rule names in PerMsgStatus. However the latest SA versions have implemented a 'check' plugin that actually runs the rules and accumulates the score. The rule running was moved to a plugin so that people could, at least in theory, change the order or the way that rules are run. It sounds like that is what you want to do, so a modified Check plugin may well be the way to go. I don't understand though why you are interested in the names of all rules run; I don't see what it buys you. Currently ALL rules are run, unless short-circuiting is in effect, and by default it mostly isn't. In any case, if a rule doesn't hit on a message, the name of the rule is probably irrelevent. It might have missed because the message is ham, but it even more likely missed because it simply targets a different kind of spam. So assuming that rules not hit === good tokens is unlikely to be the case. You should be able to get Bayes to scan the rule names hit pretty easily. Bayes is just about the last rule; I think Awl comes after it. You might want to change that order, which I suspect you can do in the Check plugin. You could then modifty the Check code to push the rule names into a special header line before calling Bayes. This could probably be done in Check, and could certainly be done by a one-off plugin that you wrote. It would be called by a special rule just before Bayes is called, and again, it would add the current rule names to a special header bayes could see. Of course you have to modify Check to drop out the scores for the non-byes rules. Either that or rescore all of the rules. Loren
Confused about which bayes db gets used with spamc?
Hello, I run spamc from my procmail on incoming messages. Does this mean that all messages are using root bayes_db? If so why do the clients have stuff updated in their db in their home directories? I am trying to figure this out so I can do sa-learn correctly. Thanks, CP -- View this message in context: http://www.nabble.com/Confused-about-which-bayes-db-gets-used-with-spamc--tf4004657.html#a11373245 Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: Confused about which bayes db gets used with spamc?
On Sat, Jun 30, 2007 at 05:41:19AM -0700, CptanPanic wrote: Hello, I run spamc from my procmail on incoming messages. Does this mean that all messages are using root bayes_db? No. If so why do the clients have stuff updated in their db in their home directories? Because spamc (actually spamd) does a setuid to the user. I am trying to figure this out so I can do sa-learn correctly. With your setup (same as mine) you should sa-learn as the user, or use the -u or --username option to set the user. Thanks, CP -- View this message in context: http://www.nabble.com/Confused-about-which-bayes-db-gets-used-with-spamc--tf4004657.html#a11373245 Sent from the SpamAssassin - Users mailing list archive at Nabble.com. Cheers, -- Bob McClure, Jr. Bobcat Open Systems, Inc. [EMAIL PROTECTED] http://www.bobcatos.com The Lord says: These people come near to me with their mouth and honor me with their lips, but their hearts are far from me. Their worship of me is made up only of rules taught by men. Therefore once more I will astound these people with wonder upon wonder; the wisdom of the wise will perish, the intelligence of the intelligent will vanish. Isaiah 29:13-14 (NIV)
Re: A different approach to scoring spamassassin hits
On Jun 30, 2007, at 8:07 AM, Loren Wilton wrote: You have a bit of a chicken and egg problem at the start. Until some learning takes place in the system. Two possibilities. The rules exist and have scores. Assume they are maintained, for whatever reason. 1.Until Bayes has enough info to kick in, classification is done by the scores. Then when Bayes kicks in the scores turn off (insofar as adding to themessage score, they might still show up as tokens in the message that Bayes will process). 2.Divide all the scores by 10 or 20. The leave them on. Pretty soon bayes will override almost any reasonable score combination. BTW, while ham rules are possible, SA has almost no ham rules; perhaps two or so. Spammers long ago found they could write their spams to match ham rules and thus bypass SA. Thus, no ham rules, no spmammer workarounds. Of course personal or ste specific ham rules will generally still work, since they will not be public knowledge and spammers won't be able to target them. I suspect you can find all rule names in PerMsgStatus. However the latest SA versions have implemented a 'check' plugin that actually runs the rules and accumulates the score. The rule running was moved to a plugin so that people could, at least in theory, change the order or the way that rules are run. It sounds like that is what you want to do, so a modified Check plugin may well be the way to go. I don't understand though why you are interested in the names of all rules run; I don't see what it buys you. Currently ALL rules are run, unless short-circuiting is in effect, and by default it mostly isn't. In any case, if a rule doesn't hit on a message, the name of the rule is probably irrelevent. It might have missed because the message is ham, but it even more likely missed because it simply targets a different kind of spam. So assuming that rules not hit === good tokens is unlikely to be the case. But in Bayes, you can't score on the absence of a token. Just because the email I'm writing does not contain a certain word does not mean it is good. The listing of ALL rules run with a binary YES/NO indication applied to each one would permit you to accrue points for both the presence of and lack of a specific rule. But this would allow you to start applying pro Ham rules as well. But you may have a point that rules not hit is sufficient for determining good tokens in the same manner that viagra is bad and not having viagra permits the email to score on the other tokens available. To further prove this out, the practice of spammers (who I'm sure are reading this list) is to try to apply enough skew to the Bayes to push it low and skip enough rules to keep from scoring any hits -- the net effect is to come up with Unsure email (I work in a ternary system). Under pure bayesian statistics, the cutoff points for ham/spam tend to move pretty quickly from a nominal 0.3/0.7 to 0.3/0.5 giving the entire probability range of 0.500 to 1.00 over to Spam and 0.00 to 0.300 (or even lower) to specifically Ham with a belt of uncertainty in the middle. And after typing all this I'm thinking you might be right. But part of this approach is to run all these rules in YES/NO fashion and see if the probability is significant. For example: If I tested for SOME_TEST=NO and found it was scoring a probability of ~0.500 then it's indisputable that you are right. The only area of exception to this would be some kind of AWL factor rather than a hard coded AWL override. Creative Regex can handle this by capturing the email addresses in FROM: and providing a very strong probability for that. Not a Whitelist, but an indication. Not sure, haven't considered it as I never found AWL to be really useful compared against the impact of Bayes on headers. As for the start up effectiveness. There are a variety of ways to do this. I consider this similar to installing linux. It might be harder to do than buying a computer with Windows installed for you, but the long term benefits out weigh the short term gains and how often do you really install Linux or SpamAssassin? You can always seed the data from captured emails. Thank you for the information on Check. I will look into that and see if I can come up with something that will do the trick. I have to confess I'm coming into this backwards, I wrote a bayesian spam filter and then started looking into SpamAssassin so my Bayes statistical Engine is not SpamAssassins. But the results will be the same for either approach (I hope) if you simply push rules in as meta- data tokens into the Statistical Process.
Re: config clarification
On Sat, 2007-06-30 at 07:07 -0400, Tom Allison wrote: For configuration options listed in perldoc Mail::SpamAssassin can I put the settings into local.cf? Mail::SpamAssassin::Conf says yes, but it doesn't say it applies to args for Mail::SpamAssassin-new(); According to the perldoc If none of rules_filename, site_rules_filename, user- prefs_filename, or config_text is set, the Mail::SpamAssassin module will search for the configuration files in the usual installed locations using the below variable definitions which can be passed in. PREFIX Used as the root for certain directory paths such as: '__prefix__/etc/mail/spamassassin' '__prefix__/etc/spamassassin' Defaults to /usr. DEF_RULES_DIR Location where the default rules are installed. Defaults to /usr/share/spamassassin. LOCAL_RULES_DIR Location where the local site rules are installed. Defaults to /etc/mail/spamassassin. If your local.cf is in /etc/mail/spamassassin, then apparently the answer is yes. My undersanding is that everything in that directory gets read. -- Lindsay Haisley | In an open world,| PGP public key FMP Computer Services |who needs Windows | available at 512-259-1190 | or Gates| http://pubkeys.fmp.com http://www.fmp.com| |
Re: A different approach to scoring spamassassin hits
Tom Allison wrote: On Jun 30, 2007, at 1:20 AM, Marc Perkel wrote: Tom Allison wrote: For some years now there has been a lot of effective spam filtering using statistical approaches with variations on Bayesian theory, some of these are inverse Chi Square modifications to Niave Bayes or even CRM114 and other languages have been developed to improve the scoring of statistical analysis of spam. For all statistical processes the spamicity is always between 0 and 1. snip Many Thanks for those of you who have read this far for your patience and consideration. Tom, I suggested something somilar to that years ago and I'd still like to see it tried out. I wonder what would happen if you stripped ot the body and ran bayes just on the headers and the rules and let bayes figure it out. You do have to have some points to start with to get bayes pointed in the right direction. But you could use black lists and white lists to do bayes training. Also needs more rules to identify ham and not just rules to identify spam. I was under the belief that there were Ham-centric tests that would result in negative point scorings. Ham doesn't try to be evasive. It's pretty easy to identify. Without SA tagging much of it falls to 0.5 and whitelisting would capture much of the exceptions. As for headers only testing -- The first five lines of stock spam is very telling... My question about SA is the PerMsgStatus (I think) Is this the place to retrieve all the rules information? I know today you can get a list of all the rules that HIT, but is there where you would look to find all the rules that were attempted? Or is there a better place for it? There are some ham tests in SA but not nearly enough.
Re: A different approach to scoring spamassassin hits
Loren Wilton wrote: You have a bit of a chicken and egg problem at the start. Until some learning takes place in the system. Two possibilities. The rules exist and have scores. Assume they are maintained, for whatever reason. 1.Until Bayes has enough info to kick in, classification is done by the scores. Then when Bayes kicks in the scores turn off (insofar as adding to themessage score, they might still show up as tokens in the message that Bayes will process). 2.Divide all the scores by 10 or 20. The leave them on. Pretty soon bayes will override almost any reasonable score combination. BTW, while ham rules are possible, SA has almost no ham rules; perhaps two or so. Spammers long ago found they could write their spams to match ham rules and thus bypass SA. Thus, no ham rules, no spmammer workarounds. Of course personal or ste specific ham rules will generally still work, since they will not be public knowledge and spammers won't be able to target them. I suspect you can find all rule names in PerMsgStatus. However the latest SA versions have implemented a 'check' plugin that actually runs the rules and accumulates the score. The rule running was moved to a plugin so that people could, at least in theory, change the order or the way that rules are run. It sounds like that is what you want to do, so a modified Check plugin may well be the way to go. I don't understand though why you are interested in the names of all rules run; I don't see what it buys you. Currently ALL rules are run, unless short-circuiting is in effect, and by default it mostly isn't. In any case, if a rule doesn't hit on a message, the name of the rule is probably irrelevent. It might have missed because the message is ham, but it even more likely missed because it simply targets a different kind of spam. So assuming that rules not hit === good tokens is unlikely to be the case. You should be able to get Bayes to scan the rule names hit pretty easily. Bayes is just about the last rule; I think Awl comes after it. You might want to change that order, which I suspect you can do in the Check plugin. You could then modifty the Check code to push the rule names into a special header line before calling Bayes. This could probably be done in Check, and could certainly be done by a one-off plugin that you wrote. It would be called by a special rule just before Bayes is called, and again, it would add the current rule names to a special header bayes could see. Of course you have to modify Check to drop out the scores for the non-byes rules. Either that or rescore all of the rules. Just a thought - what if we had some central servers for real time reporting where the SA rule hits and scores were reported in real time for some sort of live scoring or analysis or dynamic adjusting? Just thinking out loud here.
plugins
What is the best way to check what plugins SA is using?
Re: user_prefs
On Fri, 29 Jun 2007 at 19:43 -0400, [EMAIL PROTECTED] confabulated: OK, thanks. I'm not using spamassassin or spamd. I'm using Mail::SpamAssassin in a perl script. What does '-x' do for Mail::SpamAssassin? Nothing being you are calling SA directly from perl. You should set dont_copy_prefs to 1 in your call to: $t = Mail::SpamAssassin-new(); Taken from: http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin.html ... dont_copy_prefs If set to 1, the user preferences file will not be created if it doesn't already exist. (default: 0)
Re: A different approach to scoring spamassassin hits
On 6/29/07, Tom Allison [EMAIL PROTECTED] wrote: The thought I had, and have been working on for a while, is changing how the scoring is done. Rather than making Bayes a part of the scoring process, make the scoring process a part of the Bayes statistical Engine. As an example you would simply feed into the Bayesian process, as tokens, the indications of scoring hits (binary yes/no) would be examined next to the other tokens in the message. There are a few problems with this. (1) It assumes that Bayesian (or similar) classification is more accurate than SA's scoring system. Either that, or you're willing to give up accuracy in the name of removing all those confusing knobs you don't want to touch, but it would seem to me to be better to have the knobs and just not touch them. (2) For many SA rules you would be, in effect, double-counting some tokens. An SA scoring rule that matches a phrase, for example, is effectively matching a collection of tokens that are also being fed individually to the Bayes engine. In theory, you should not second-guess the system by passing such compound tokens to Bayes; instead it should be allowed to learn what combinations of tokens are meaningful when they appear together. (It might be worthwhile, though, to e.g. add tokens that are not otherwise present in the message, such as for the results of network tests.) (3) It introduces a bootstrapping problem, as has already been noted. Everyone has to train the engine and re-train it when new rules are developed. I've thought of a few more, but they all have to do with the benifits of having all those knobs and if you've already adopted the basic premise that they should be removed there doesn't seem to be any reason to argue that part. To summarize my opinion: If what you want is to have a Bayesian-type engine make all the decisions, then you should install a Bayesian engine and work on ways to feed it the right tokens; you should not install SpamAssassin and then work on ways to remove the scoring.
DNS list service to detect the registrar barrier
OK - tell me if this is useful. I created a DNS list that you can pass a host name to and get information as to where the registrar barrier is. You can use it as follows: dig host.rb.junkemailfilter.com Example: dig perkel.com.rb.junkemailfilter.com - returns 127.0.0.1 dig perkel.co.uk.rb.junkemailfilter.com - returns 127.0.0.2 If it's a single level domain it will return 127.0.0.1 Two level domains return 127.0.0.2 Three level domains return 127.0.0.3 I'm using it for some statistical stuff but I'm wondering if anyone else finds this useful. Thinking about using it to forward spam to abuse@domain to report spam.
Re: A different approach to scoring spamassassin hits
On Jun 30, 2007, at 2:55 PM, Bart Schaefer wrote: On 6/29/07, Tom Allison [EMAIL PROTECTED] wrote: The thought I had, and have been working on for a while, is changing how the scoring is done. Rather than making Bayes a part of the scoring process, make the scoring process a part of the Bayes statistical Engine. As an example you would simply feed into the Bayesian process, as tokens, the indications of scoring hits (binary yes/no) would be examined next to the other tokens in the message. There are a few problems with this. (1) It assumes that Bayesian (or similar) classification is more accurate than SA's scoring system. Either that, or you're willing to give up accuracy in the name of removing all those confusing knobs you don't want to touch, but it would seem to me to be better to have the knobs and just not touch them. I know that without SA you can have 99.9% accuracy with pure bayesian classification. But there are specific non Bayes things that are made visible through spamassassin rules that a typical bayes process can't catch (very well or at all). The whole issue of knobs is moot under a statistical approach because each users scoring will determine the real importance of each particular rule hit. (2) For many SA rules you would be, in effect, double-counting some tokens. An SA scoring rule that matches a phrase, for example, is effectively matching a collection of tokens that are also being fed individually to the Bayes engine. In theory, you should not second-guess the system by passing such compound tokens to Bayes; instead it should be allowed to learn what combinations of tokens are meaningful when they appear together. Bayes does not match a phrase, only words. At least that is what most Bayes filters do. There are some approaches that do use multiple words, but not a phrase. Therefore I think the intersection of Bayes and Spamassassin rules is going to be small. (It might be worthwhile, though, to e.g. add tokens that are not otherwise present in the message, such as for the results of network tests.) This is what I'm interested in and mentioned in paragraph one. There are a lot of things you can do with SpamAssassin that just Bayes will never do. It is exactly this type of work that I think would be most interesting to pursue. (3) It introduces a bootstrapping problem, as has already been noted. Everyone has to train the engine and re-train it when new rules are developed. I've thought of a few more, but they all have to do with the benifits of having all those knobs and if you've already adopted the basic premise that they should be removed there doesn't seem to be any reason to argue that part. To summarize my opinion: If what you want is to have a Bayesian-type engine make all the decisions, then you should install a Bayesian engine and work on ways to feed it the right tokens; you should not install SpamAssassin and then work on ways to remove the scoring. It makes sense to do this approach. However it would not make sense to try and reinvent the fantastic amount of useful work that has come from SpamAssassin. That would take a very long time to address. SpamAssassin has some really great ways of finding the right tokens. Why would I consider trying to duplicate all that effort.
Re: A different approach to scoring spamassassin hits
And after typing all this I'm thinking you might be right. But part of this approach is to run all these rules in YES/NO fashion and see if the probability is significant. For example: If I tested for SOME_TEST=NO and found it was scoring a probability of ~0.500 then it's indisputable that you are right. Well, this still doesn't make any real sense to me; it seems equivalent to the attempts at bayes poison that spammers stick into their spams: a bunch of words totally unrelated to the mail in the hopes of outweighing the useful terms. Now their trick works as a good spam indication because the words they pick aren't common to my ham mails, so it is really a good spam indication rather than poison. I'm not immediately convinced that will hold for the usage you intend. Maybe. Maybe not. However, if you want to do this, remember that bayes works on tokens and has a tokenizer. So SOME_RULE=YES is probably either two or three tokens, and you will end up scoring on the probability of YES and NO, along with the frequency of the rule names, which will be 1. So you probably want to do NO_SOME_RULE and YES_OTHER_RULE or the like when you build the insert list. Again though I'm not sure I see the point in the yes and no factors; the presence or absense of a word in the mail seems like a pretty good yes/no indication to me. Were I doing it I'd try it both ways and see if there is any difference in results. Loren
Re: A different approach to scoring spamassassin hits
Just a thought - what if we had some central servers for real time reporting where the SA rule hits and scores were reported in real time for some sort of live scoring or analysis or dynamic adjusting? Just thinking out loud here. Something I've wanted to see for about 4 years now; ie: as long as I've been using SA. You could think of it as a super mass-check in realtime. There are arguments that large hosting companies wouldn't let the data out because it woudl compromise their mail stream. That would of course be true if the sent the mail. If they just send the cumulative scores over the last hour or whatever I don't see that being true; although doubtless some would still consider that to be the case and wouldn't send it. However, I'd bet that enough info would arive from all parts of the globe to be able to do weekly or maybe even every few hours rescoring runs and publish new scores, pretty much like the virus guys publish new signatures pretty quickly. There is the question of how to integrate the new scores with local rescoring, and even with local rules that were scored based on the original score of the stock rules. I think there are a half-dozen solutions to this that would be moderately easy to implement. The most obvious would be sending score updates either in the form of a multiplier or an adder to the original rule score rather than as a raw score; this would preserve local overrides while still adjusting the score to match daily hit rates. (Don't bother me with the obvious point of adjusting zeroed scores off of zero. That is an exception that simply has to be handled in the score readjustment; it isn't a concept-breaker.) If the rescoring client at a site wanted to be fancy, it could even send an optional email to the mail admin telling him that some local rule is bad for his health or that some zeroed rule has now become useful and should be unzeroed. Or the like. Loren
Re: Spam PDF
arni wrote: [snip snap] I looked for the lowest scoring email of the past 2 days (dont save them longer), this is the one: X-Spam-Status: Yes, score=10.7 required=5.0 tests=BAYES_99,DCC_CHECK, DKIM_POLICY_SIGNSOME,HTML_MESSAGE,LOGINHASH1,LOGINHASH2,MIME_HTML_MOSTLY autolearn=no version=3.2.0 X-Spam-Report: * 5.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100% * [score: 1.] * 0.0 DKIM_POLICY_SIGNSOME Domain Keys Identified Mail: policy says domain * signs some mails * 0.0 MIME_HTML_MOSTLY BODY: Multipart message mostly text/html MIME * 0.0 HTML_MESSAGE BODY: HTML included in message * 1.5 LOGINHASH2 BODY: mail has been classified as spam @ unknown company, * Germany * 1.5 LOGINHASH1 BODY: mail has been classified as spam @ LogInSolutions * AG, Germany * 2.2 DCC_CHECK Listed in DCC (http://rhyolite.com/anti-spam/dcc/) Note that already a well trained BAYES can take these mails out on its own on my system. Bayes are good if its well trained If you find your bayes to score really acurate then its a good idea to increase the scores. For me bayes is fed from 2 spamtrap addresses with around 50 pieces of the finest spam every day. Doing this, bayes scores BAYES_99 on 99.5% of my remaining spam - i hardly ever see it score below BAYES_80 and thats just great. Kind a new to spam ... and especially how people use bayes. So how many ham mails do you get per day ? wandering if I could do something to my system so bayes may score higher I have read some where that spam mails in bayes should be alot higher than ham mails ... is that true ? Cause I'm doing spam scans for multiple domains .. So maybe training bayes better or increasing the score will put and end to this for you. arni Any aditional reading on bayes are welcome ... // Mikael Syska
Re: Spam PDF
Mikael Syska schrieb: Kind a new to spam ... and especially how people use bayes. So how many ham mails do you get per day ? wandering if I could do something to my system so bayes may score higher I have read some where that spam mails in bayes should be alot higher than ham mails ... is that true ? Cause I'm doing spam scans for multiple domains .. my mail volume isnt high, i do it only for myself and some friends, some stats on my bayes db: 0.000 0 4556 0 non-token data: nspam 0.000 0 1356 0 non-token data: nham 0.000 0 280877 0 non-token data: ntokens i get about 20 ham and 150 spams per day (on my personal box) - bayes is only learned by spamtraps and autolearn. arni
URIBL_BLACK matching on messages with no URLs in them...
Note: yes, uribl has their own mailing list. That server has been down for quite some time, so I gave up and posted it here in case someone is dual listed and can fix it. There's no URL in this message. What is it mis-matching against? Begin forwarded message: From: *snip* Date: June 29, 2007 9:44:01 AM PDT To: [EMAIL PROTECTED] Subject: [Fwd: Cron [EMAIL PROTECTED] /etc/webmin/time/sync.pl] Return-Path: *snip* Received: from kininvie.sv.svcolo.com ([unix socket]) by svcolo.com (Cyrus v2.3.7) with LMTPA; Fri, 29 Jun 2007 09:44:09 -0700 Received: *snip* X-Sieve: CMU Sieve 2.3 X-Sasl-Enc: MVo3NfRHq5jjBkzoJvK9LGyw0IT35eGmQjh72kfveVrb 1183135440 Message-Id: [EMAIL PROTECTED] User-Agent: Thunderbird 2.0.0.4 (Windows/20070604) Mime-Version: 1.0 Content-Type: multipart/mixed; boundary=000507030901020505050806 X-Bayes-Prob: 0.0001 (Score 0) X-Spam-Flag: YES X-Spam-Score: 5.00 (*) [Tag at 3.50] URIBL_BLACK,SPF(none,0) X-Canitpro-Stream: support (inherits from default) X-Canit-Stats-Id: 117735 - 63721d93a4a2 X-Scanned-By: CanIt (www . roaringpenguin . com) on 64.13.135.12 Something change with the ntp server? From: [EMAIL PROTECTED] (Cron Daemon) Date: June 29, 2007 9:00:06 AM PDT To: [EMAIL PROTECTED] Subject: Cron [EMAIL PROTECTED] /etc/webmin/time/sync.pl Failed to connect to ntp0.svcolo.com:37 : Connection refused -- Jo Rhett senior geek Silicon Valley Colocation Support Phone: 408-400-0550
Re: plugins
On Sat, Jun 30, 2007 at 11:22:36AM -0700, JP Kelly wrote: What is the best way to check what plugins SA is using? Same as everything else, run spamassassin -D --lint. :) -- Randomly Selected Tagline: Internet exceeded user level, please wait until a user logs off before attempting to log back on. - Today's BOFH Excuse pgpWfyGKkTRd6.pgp Description: PGP signature
Re: URIBL_BLACK matching on messages with no URLs in them...
On Sat, Jun 30, 2007 at 12:07:04PM -0700, Jo Rhett wrote: There's no URL in this message. What is it mis-matching against? When in doubt, run through spamassassin -D: [9710] dbg: uridnsbl: domains to query: sync.pl svcolo.com SA doesn't just look for full URLs, it looks for things that could be hostnames ala copy www.example.com into your browser. -- Randomly Selected Tagline: If all the girls who attended the Harvard-Yale game were laid end to end, I wouldn't be surprised. - Dorothy Parker pgpHTbEqbDoiU.pgp Description: PGP signature
Re: A different approach to scoring spamassassin hits
On Jun 30, 2007, at 6:29 PM, Loren Wilton wrote: And after typing all this I'm thinking you might be right. But part of this approach is to run all these rules in YES/NO fashion and see if the probability is significant. For example: If I tested for SOME_TEST=NO and found it was scoring a probability of ~0.500 then it's indisputable that you are right. Well, this still doesn't make any real sense to me; it seems equivalent to the attempts at bayes poison that spammers stick into their spams: a bunch of words totally unrelated to the mail in the hopes of outweighing the useful terms. Now their trick works as a good spam indication because the words they pick aren't common to my ham mails, so it is really a good spam indication rather than poison. I'm not immediately convinced that will hold for the usage you intend. Maybe. Maybe not. However, if you want to do this, remember that bayes works on tokens and has a tokenizer. So SOME_RULE=YES is probably either two or three tokens, and you will end up scoring on the probability of YES and NO, along with the frequency of the rule names, which will be 1. So you probably want to do NO_SOME_RULE and YES_OTHER_RULE or the like when you build the insert list. Again though I'm not sure I see the point in the yes and no factors; the presence or absense of a word in the mail seems like a pretty good yes/no indication to me. Were I doing it I'd try it both ways and see if there is any difference in results. I agree with you that it's probably not going to be very effective to use a binary token (eg: SOME_RULE=YES vs SOME_RULE=NO) compared to the presence of the rule (SOME_RULE exists implies SOME_RULE=YES). So the method: $list = $status-get_names_of_tests_hit () may cover everything that is required to evaluate this approach. Unfortunately I'm not on the SpamAssassin Bayes modules -- I wrote my own Bayes Engine because I wanted to do that and then thought about including the Rules results from SpamAssassin. I don't know where this might be going, but it seems to be working extremely well for me based on a training set of just a couple hundred emails in total.
Re: URIBL_BLACK matching on messages with no URLs in them...
At 12:07 30-06-2007, Jo Rhett wrote: Note: yes, uribl has their own mailing list. That server has been down for quite some time, so I gave up and posted it here in case someone is dual listed and can fix it. There's no URL in this message. What is it mis-matching against? There was a URL in the message. It's not listed in URIBL. Regards, -sm
Re: A different approach to scoring spamassassin hits
Unfortunately I'm not on the SpamAssassin Bayes modules -- I wrote my own Bayes Engine because I wanted to do that and then thought about including the Rules results from SpamAssassin. I don't know where this might be going, but it seems to be working extremely well for me based on a training set of just a couple hundred emails in total. Don't see this as a problem. Someone, I forget who, has a Bayes chained to an SA setup, I think the Bayes comes first, but I don't recall. He was claiming good results from chained classifiers using slightly different data and methods. This seems like a reasonably possible contention to me. If you have a pre-existing Bayes mail filter, and it runs as a filter in a pipe or the like, then basically what you want to do seems very simple to me, at least conceptually. Just run the mail through SA first and then into your classifier. The rule names hit along with their scores will be in the header of the mail you process in your classifier, and thus, as long as you don't ignore header data, the rule names are there to process. No need even to modify SA. In fact you can get a header with just the rule names hit without the scores, so you don't have the score values being scored as tokens. The only case where you would have to modify SA in I think either Check or PMS is if you really did want to bloat every mail with the names of all of the rules in the SA database, rather than just those pertanent to the mail at hand. I hink the trick is simply looking at your mail chain and figuring out how to insert a call to SA before the call to your own Bayes module. Loren