Re: Auto-learning ‘considered harmful’: not so much when rejecting spam?
On 1/17/2023 7:33 AM, David Bürgin wrote: I have heard it said many times on this list that auto-learning is discouraged, so I decided to finally look into disabling it. But then I realised that I do have a use for auto-learning: In my setup, I use a milter to reject certain spam (score > 10.0). Now, if I turn off auto-learning I lose something. Because, as far as I understand the default spam auto-learning threshold of 12.0 causes incoming high-probability spam to be learned as spam, even though the message is then rejected and not available locally later. Is my understanding correct? Auto-learning of spam can be useful if spam is rejected during the SMTP conversation but after it has been seen – and learned – by SpamAssassin? On 17.01.23 09:37, Kevin A. McGrail wrote: The problem with auto learning I've seen is that it slowly spirals miscategorization errors. mostly because there are no really useful indicators of hamminess, and if they are, spammers use them to spread their junk. after long manual training beingocasionally spoiled by autolearn, I have manually selected all rules that have negative scores to noautolearn: tflags RCVD_IN_RP_CERTIFIEDnoautolearn net nice tflags RCVD_IN_VALIDITY_CERTIFIED noautolearn net nice tflags RCVD_IN_RP_SAFE noautolearn net nice tflags RCVD_IN_VALIDITY_SAFE noautolearn net nice tflags RCVD_IN_DNSWL_LOW noautolearn net nice tflags RCVD_IN_DNSWL_MED noautolearn net nice tflags RCVD_IN_DNSWL_HInoautolearn net nice tflags RCVD_IN_MSPIKE_H2 noautolearn net nice tflags RCVD_IN_MSPIKE_H3 noautolearn net nice tflags RCVD_IN_MSPIKE_H4 noautolearn net nice tflags RCVD_IN_MSPIKE_H5 noautolearn net nice tflags RCVD_IN_MSPIKE_WL noautolearn net nice tflags RCVD_IN_IADB_DK noautolearn net nice tflags RCVD_IN_IADB_DOPTIN noautolearn net nice tflags RCVD_IN_IADB_LISTED noautolearn net nice tflags RCVD_IN_IADB_MI_CPR_MAT noautolearn net nice tflags RCVD_IN_IADB_ML_DOPTIN noautolearn net nice tflags RCVD_IN_IADB_OPTIN noautolearn net nice tflags RCVD_IN_IADB_OPTIN_GT50 noautolearn net nice tflags RCVD_IN_IADB_RDNS noautolearn net nice tflags RCVD_IN_IADB_SENDERID noautolearn net nice tflags RCVD_IN_IADB_SPFnoautolearn net nice tflags RCVD_IN_IADB_UT_CPR_MAT noautolearn net nice tflags RCVD_IN_IADB_VOUCHEDnoautolearn net nice tflags DKIMWL_WL_HIGH noautolearn net nice tflags DKIMWL_WL_MEDHI noautolearn net nice tflags DKIMWL_WL_MED noautolearn net nice tflags DKIM_VALID noautolearn net nice tflags DKIM_VALID_EF noautolearn net nice still needs some training. and, in some places, you may need to dump the database and re-train from scratch. That's why manual training is great and why you need to keep some spam, but mostly ham. The technical term is that it reinforces a bias. A Bayes database should be carefully maintained. It's not very much of a fire and forget technology. And, for example, letting user's control it becomes a question of "what is spam?" For example, users might get a very legit mail BUT they are tired of seeing it in their inbox. So they want to train it as spam. If you have per-user implementations, that can be good BUT you need a few hundred samples of good email and bad email to activate Bayes. In short, I don't have a good solution for training Bayes that isn't a lot of work but auto-learning is usually a bad solution. One case where it might be good is if you had a system setup that you fed emails to that were classified. It would then use that good feed to use the auto-learning and add a way of learning without using the command line. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. It's now safe to throw off your computer.
Re: Auto-learning ‘considered harmful’: not so much when rejecting spam?
On 1/17/2023 7:33 AM, David Bürgin wrote: I have heard it said many times on this list that auto-learning is discouraged, so I decided to finally look into disabling it. But then I realised that I do have a use for auto-learning: In my setup, I use a milter to reject certain spam (score > 10.0). Now, if I turn off auto-learning I lose something. Because, as far as I understand the default spam auto-learning threshold of 12.0 causes incoming high-probability spam to be learned as spam, even though the message is then rejected and not available locally later. Is my understanding correct? Auto-learning of spam can be useful if spam is rejected during the SMTP conversation but after it has been seen – and learned – by SpamAssassin? The problem with auto learning I've seen is that it slowly spirals miscategorization errors. The technical term is that it reinforces a bias. A Bayes database should be carefully maintained. It's not very much of a fire and forget technology. And, for example, letting user's control it becomes a question of "what is spam?" For example, users might get a very legit mail BUT they are tired of seeing it in their inbox. So they want to train it as spam. If you have per-user implementations, that can be good BUT you need a few hundred samples of good email and bad email to activate Bayes. In short, I don't have a good solution for training Bayes that isn't a lot of work but auto-learning is usually a bad solution. One case where it might be good is if you had a system setup that you fed emails to that were classified. It would then use that good feed to use the auto-learning and add a way of learning without using the command line. Regards, KAM -- Kevin A. McGrail kmcgr...@apache.org Member, Apache Software Foundation Chair Emeritus Apache SpamAssassin Project https://www.linkedin.com/in/kmcgrail - 703.798.0171
Auto-learning ‘considered harmful’: not so much when rejecting spam?
I have heard it said many times on this list that auto-learning is discouraged, so I decided to finally look into disabling it. But then I realised that I do have a use for auto-learning: In my setup, I use a milter to reject certain spam (score > 10.0). Now, if I turn off auto-learning I lose something. Because, as far as I understand the default spam auto-learning threshold of 12.0 causes incoming high-probability spam to be learned as spam, even though the message is then rejected and not available locally later. Is my understanding correct? Auto-learning of spam can be useful if spam is rejected during the SMTP conversation but after it has been seen – and learned – by SpamAssassin?
Re: Question regarding auto-learning
On 03.07.18 12:17, J Doe wrote: From reading the documentation, it appears that to train the Bayesian filter I require a minimum of 1,000 pieces of ham and 1,000 pieces of spam. no. You need at least 200 hams and spams for bayes to start firing but you can tune it bu setting bayes_min_ham_num and bayes_min_spam_num. note that too few mails trained can result in false positives/negatives. I am currently collecting spam on one of my servers via a spam trap address and slowly reaching that number. I was wondering, though, if I can use auto learning (bayes_auto_learn 1), before training the database ? autolearning does training instead of you. manual training is still faster and more precise. When autolearn fires on messages at the moment, it is correctly detecting ham and spam based on the default ham and spam thresholds: bayes_auto_learn_threshold_nonspam 0.1 bayes_auto_learn_threshold_spam 12.0 Can this be used before training the database or is it more often used to supplement (on an ongoing basis), a database that has already be trained ? those don't contradict each other. you can use manual and automatic learning both. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Chernobyl was an Windows 95 beta test site.
Question regarding auto-learning
Hello, I have a question regarding autolearning and Bayes functionality. From reading the documentation, it appears that to train the Bayesian filter I require a minimum of 1,000 pieces of ham and 1,000 pieces of spam. I am currently collecting spam on one of my servers via a spam trap address and slowly reaching that number. I was wondering, though, if I can use auto learning (bayes_auto_learn 1), before training the database ? When autolearn fires on messages at the moment, it is correctly detecting ham and spam based on the default ham and spam thresholds: bayes_auto_learn_threshold_nonspam 0.1 bayes_auto_learn_threshold_spam 12.0 Can this be used before training the database or is it more often used to supplement (on an ongoing basis), a database that has already be trained ? Thanks, - J
Re: Bayes not auto-learning?
On 02/24/2018 01:05 AM, Amir Caspi wrote: On Feb 23, 2018, at 11:47 PM, David B Funk wrote: It could have 20 points from a whole bunch of body rules but if it only hit 2 points via header rules it still will not auto-learn. Gotcha. The spam in question that triggered this hit a lot of rules, but hard for me to tell on cursory inspection whether it satisfies sufficient header and body points. But it LOOKS like there should be at least 3 points from header (MISSING_HEADERS, FREEMAIL_FORGED_REPLYTO, among others) and certainly 3 body (MONEY_FRAUD_3 at the very least). The actual spam report is this: * 0.0 FSL_CTYPE_WIN1251 Content-Type only seen in 419 spam * 0.0 NSL_RCVD_FROM_USER Received from User * 1.0 MISSING_HEADERS Missing To: header * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5004] * 1.1 DCC_CHECK Detected as bulk mail by DCC (dcc-servers.net) * 0.0 FROM_MISSP_MSFT From misspaced + supposed Microsoft tool * 0.0 FSL_NEW_HELO_USER Spam's using Helo and User * 2.6 MSOE_MID_WRONG_CASE No description available. * 0.0 FROM_MISSP_USER From misspaced, from "User" * 1.0 RDNS_DYNAMIC Delivered to internal network by host with * dynamic-looking rDNS * 0.0 LOTS_OF_MONEY Huge... sums of money * 0.0 FROM_MISSP_XPRIO Misspaced FROM + X-Priority * 1.6 REPLYTO_WITHOUT_TO_CC No description available. * 0.0 AXB_XMAILER_MIMEOLE_OL_024C2 Yet another X header trait * 0.0 MSGID_FROM_MTA_HEADER Message-Id was added by a relay * 0.0 FSL_BULK_SIG Bulk signature with no Unsubscribe * 2.1 FREEMAIL_FORGED_REPLYTO Freemail in Reply-To, but not From * 1.0 FREEMAIL_REPLYTO Reply-To/From or Reply-To/body contain different * freemails * 0.0 TO_NO_BRKTS_FROM_MSSP Multiple header formatting problems * 1.9 FORGED_MUA_OUTLOOK Forged mail pretending to be from MS Outlook * 1.6 TO_NO_BRKTS_DYNIP To: lacks brackets and dynamic rDNS * 0.0 FILL_THIS_FORM Fill in a form with personal information * 2.0 TO_NO_BRKTS_MSFT To: lacks brackets and supposed Microsoft tool * 2.0 FILL_THIS_FORM_LONG Fill in a form with personal information * 3.1 FROM_MISSP_FREEMAIL From misspaced + freemail provider * 3.0 MONEY_FRAUD_3 Lots of money and several fraud phrases But, it still didn't autolearn. (I can post the entire spample if the above seems like it should have autolearned.) Another possible factor, if you have "bayes_auto_learn_on_error" enabled, then autolearn will be skipped if Bayes already agrees with the condition of the message. IE: if the message is already classifed as BAYES_99 then it won't bother auto-learning it as yet another high-ranking spam. I do not have that enabled. Also, as you can see from above, this hit BAYES_50. Does the above provide an indication as to why it didn't autolearn? Thanks! --- Amir I found the best thing to do is setup a hidden mail server (iRedMail) and split a copy of all mail to it to sort and filter into a Ham and Spam folder based on rule hits and scoring. Then I run a nightly sa-learn on the Ham and Spam folders (in that order). The few questionable emails that score in the middle stay in the Inbox so I just have to drag-n-drop into the Ham or Spam folder taking a few minutes a day. Some that are new phishing campaigns or are from compromised accounts are copied into a Spamcop folder that automatically submits it to my Spamcop account. I also use the Ham and Spam folders for the nightly SA masscheck to help get new rules validated and new 72_scores.cf update daily via sa-update. -- David Jones
Re: Bayes not auto-learning?
On 2/24/2018 2:05 AM, Amir Caspi wrote: Does the above provide an indication as to why it didn't autolearn? No, the above does not help as the autolearning is complicated. I believe a few years ago I added debug output or headers or something that tried to make it clearer. If it doesn't autolearn, I would not stress. It's not a simplistic, black or white decision based on a single factor. Off-hand, I can't find the work I did but $status->get_autolearn_points() might help you dig into the code. Regards, KAM
Re: Bayes not auto-learning?
On Feb 23, 2018, at 11:47 PM, David B Funk wrote: > It could have 20 points from a whole bunch of body rules but if it only hit 2 > points via header rules it still will not auto-learn. Gotcha. The spam in question that triggered this hit a lot of rules, but hard for me to tell on cursory inspection whether it satisfies sufficient header and body points. But it LOOKS like there should be at least 3 points from header (MISSING_HEADERS, FREEMAIL_FORGED_REPLYTO, among others) and certainly 3 body (MONEY_FRAUD_3 at the very least). The actual spam report is this: * 0.0 FSL_CTYPE_WIN1251 Content-Type only seen in 419 spam * 0.0 NSL_RCVD_FROM_USER Received from User * 1.0 MISSING_HEADERS Missing To: header * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5004] * 1.1 DCC_CHECK Detected as bulk mail by DCC (dcc-servers.net) * 0.0 FROM_MISSP_MSFT From misspaced + supposed Microsoft tool * 0.0 FSL_NEW_HELO_USER Spam's using Helo and User * 2.6 MSOE_MID_WRONG_CASE No description available. * 0.0 FROM_MISSP_USER From misspaced, from "User" * 1.0 RDNS_DYNAMIC Delivered to internal network by host with * dynamic-looking rDNS * 0.0 LOTS_OF_MONEY Huge... sums of money * 0.0 FROM_MISSP_XPRIO Misspaced FROM + X-Priority * 1.6 REPLYTO_WITHOUT_TO_CC No description available. * 0.0 AXB_XMAILER_MIMEOLE_OL_024C2 Yet another X header trait * 0.0 MSGID_FROM_MTA_HEADER Message-Id was added by a relay * 0.0 FSL_BULK_SIG Bulk signature with no Unsubscribe * 2.1 FREEMAIL_FORGED_REPLYTO Freemail in Reply-To, but not From * 1.0 FREEMAIL_REPLYTO Reply-To/From or Reply-To/body contain different * freemails * 0.0 TO_NO_BRKTS_FROM_MSSP Multiple header formatting problems * 1.9 FORGED_MUA_OUTLOOK Forged mail pretending to be from MS Outlook * 1.6 TO_NO_BRKTS_DYNIP To: lacks brackets and dynamic rDNS * 0.0 FILL_THIS_FORM Fill in a form with personal information * 2.0 TO_NO_BRKTS_MSFT To: lacks brackets and supposed Microsoft tool * 2.0 FILL_THIS_FORM_LONG Fill in a form with personal information * 3.1 FROM_MISSP_FREEMAIL From misspaced + freemail provider * 3.0 MONEY_FRAUD_3 Lots of money and several fraud phrases But, it still didn't autolearn. (I can post the entire spample if the above seems like it should have autolearned.) > Another possible factor, if you have "bayes_auto_learn_on_error" enabled, > then autolearn will be skipped if Bayes already agrees with the condition of > the message. IE: if the message is already classifed as BAYES_99 then it > won't bother auto-learning it as yet another high-ranking spam. I do not have that enabled. Also, as you can see from above, this hit BAYES_50. Does the above provide an indication as to why it didn't autolearn? Thanks! --- Amir
Re: Bayes not auto-learning?
On 2018-02-23 22:32, Amir Caspi wrote: > So, I've been trying to tweak my setup and noticed that VERY few of my > emails are being autolearned as spam, even when their spam threshold > is far above the autolearn threshold. The threshold is set to 12; I > just saw a spam with score >25 not being autolearned. Sigh. This really is a FAQ, and I did ask it myself (maybe more than once). Read the fine documentation. Shortned: the score that is compared to the threshold for autolearning is _not_ the normal score that determines spam/ham. Despite the fact that is is documented, I find the algorithm to be too opaque to feel in control. -- Please don't Cc: me privately on mailing lists and Usenet, if you also post the followup to the list or newsgroup. To reply privately _only_ on Usenet and on broken lists which rewrite From, fetch the TXT record for no-use.mooo.com.
Re: Bayes not auto-learning?
On Fri, 23 Feb 2018, Amir Caspi wrote: Hi all, So, I've been trying to tweak my setup and noticed that VERY few of my emails are being autolearned as spam, even when their spam threshold is far above the autolearn threshold. The threshold is set to 12; I just saw a spam with score >25 not being autolearned. Are there rules that prevent autolearning? If so, why? If a spam scores really high because it hits (let's say) 10 or more rules, but just one of those rules is enough to prevent autolearning, that seems overly restrictive, no? For example, for one of my users, out of about 650 spams received in the last month, only 10 have been autolearned. For another user, only 12 of nearly 1400. That seems like a very low percentage, and clearly some high-scoring spams are not being auto-learned. Any explanation is appreciated! Thanks! --- Amir If you read the spamassassin documentation about Bayes auto-learning you will see that there are several conditions that must be satisfied. For example, there are some types of rules which aren't considered at all when computing the auto-learning threshold score (such as white/black list scores or rules tagged with the noautolearn tflag or the actual Bayes score itself). Of the types of rules which are allowed, at least 3 of those points must come from header type rules and at least 3 of those points must come from body type rules. So a spam can have 100 points from a blacklist and not auto-learn. It could have 20 points from a whole bunch of body rules but if it only hit 2 points via header rules it still will not auto-learn. Another possible factor, if you have "bayes_auto_learn_on_error" enabled, then autolearn will be skipped if Bayes already agrees with the condition of the message. IE: if the message is already classifed as BAYES_99 then it won't bother auto-learning it as yet another high-ranking spam. What I usually see in auto-learned spam is that they hit a number of network RBL rules (spamhaus, SORBS, etc) and a number of body rules such as RAZOR, URIBLS, etc. -- Dave Funk University of Iowa College of Engineering 319/335-5751 FAX: 319/384-0549 1256 Seamans Center Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527 #include Better is not better, 'standard' is better. B{
Bayes not auto-learning?
Hi all, So, I've been trying to tweak my setup and noticed that VERY few of my emails are being autolearned as spam, even when their spam threshold is far above the autolearn threshold. The threshold is set to 12; I just saw a spam with score >25 not being autolearned. Are there rules that prevent autolearning? If so, why? If a spam scores really high because it hits (let's say) 10 or more rules, but just one of those rules is enough to prevent autolearning, that seems overly restrictive, no? For example, for one of my users, out of about 650 spams received in the last month, only 10 have been autolearned. For another user, only 12 of nearly 1400. That seems like a very low percentage, and clearly some high-scoring spams are not being auto-learned. Any explanation is appreciated! Thanks! --- Amir
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 11:12 AM, John Hardin wrote: A week or so back they briefly listed some of the MailControl.com MTAs, due to apparent exploits. They were quickly removed, though. So the message here is that some DNSBL's are better than others about including and removing addresses quickly and responsibly. Perhaps. I take no position on that. But that does not address the issue of collateral damage to users which share an ISP's email server with someone else who happened to get a spam through and reported back to the DNSBL. Not long ago, I had another client blocked from sending response emails to their on-line customers about their purchases. Turned out one of the users on the hosting provider's system had sent some spam. Now the hosting provider (Webfaction) is quite responsible, very diligent, and has *fantastic* support. (I can recommend them for dynamic language language apps with no reservations.) But guess what? The DNSBL's interface for interacting with them was down. For over a week. (We're sorry, but... Please come back when... No guaranty as to...) And emails to the affected customers were blocked for all that time. I use DNSBL's. But I don't like them. SA is indispensable. I like it. But it's a huge compilation of kluges that happen to mostly work. Expedient. Pragmatic. Not a real solution to the actual problem. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 11:10 AM, Jim Popovitch wrote: Just a heads-up... that sort of biting comment is probably not welcome I'm familiar with adapting to the relative insularities of various lists. But thanks for the head-up, Jim. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On Wed, 2 Jul 2014, Axb wrote: If a sender's IP is listed @Spamhaus , he has a serious problem reaching many, many destinations. If he's been expoited, you get good evidence and fast delisting processsing and I have yet to see a real FP with ZEN. A week or so back they briefly listed some of the MailControl.com MTAs, due to apparent exploits. They were quickly removed, though. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- There is no better measure of the unthinking contempt of the environmentalist movement for civilization than their call to turn off the lights and sit in the dark.-- Sultan Knish --- 2 days until the 238th anniversary of the Declaration of Independence
Re: Bayes, Manual and Auto Learning Strategies
On Wed, Jul 2, 2014 at 11:54 AM, Steve Bergman wrote: >> I suggest you join the SDLU list where you can discuss anti spam >> philosophy. >> > > Thanks. I suggest that you consult for an ISP-dependent business someday. > ;-) > > It's an education, too. > > -Steve Just a heads-up... that sort of biting comment is probably not welcome on the SDLU list. -Jim P.
Re: Bayes, Manual and Auto Learning Strategies
I suggest you join the SDLU list where you can discuss anti spam philosophy. Thanks. I suggest that you consult for an ISP-dependent business someday. ;-) It's an education, too. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 05:39 PM, Steve Bergman wrote: On 07/02/2014 09:48 AM, Axb wrote: If an IP is exploited/sends spam and a legitimate msg is rejected then somebody hasn't done due diligence and I see the reject as legitimated. The legitimate senders and receivers of the good message, neither of whom's companies have anything to do with the spam, would not see it that way. And I agree with their perspective. Some of the perspective I'm reading here seem really off in the ether. I get the impression that some are so frustrated with SA's limitations that they are willing to resort to desperate measures which normal users would instantly recognize as insane. No rudeness intended. But some of the things I'm reading here are just bizarre. I suggest you join the SDLU list where you can discuss anti spam philosophy. It's a great resource for knowledge. List Guidelines: http://www.new-spam-l.com/admin/faq.html List Information: https://spammers.dontlike.us/mailman/listinfo/list The Mailop list is also a good place to lurk and bathe in hundreds of years of mail related experience http://chilli.nosignal.org/mailman/listinfo/mailop
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 09:48 AM, Axb wrote: If an IP is exploited/sends spam and a legitimate msg is rejected then somebody hasn't done due diligence and I see the reject as legitimated. The legitimate senders and receivers of the good message, neither of whom's companies have anything to do with the spam, would not see it that way. And I agree with their perspective. Some of the perspective I'm reading here seem really off in the ether. I get the impression that some are so frustrated with SA's limitations that they are willing to resort to desperate measures which normal users would instantly recognize as insane. No rudeness intended. But some of the things I'm reading here are just bizarre. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 04:40 PM, Steve Bergman wrote: You are discussing about DNSBLs but not being specific. I'm specific in that all the DNSBL's blacklist IP addresses or blocks. And that in today's world many, many companies share sets of mail servers with many other companies and individuals. If an IP is exploited/sends spam and a legitimate msg is rejected then somebody hasn't done due diligence and I see the reject as legitimated. If I need to open up, I have options as the DNSWL, etc.
Re: Bayes, Manual and Auto Learning Strategies
You are discussing about DNSBLs but not being specific. I'm specific in that all the DNSBL's blacklist IP addresses or blocks. And that in today's world many, many companies share sets of mail servers with many other companies and individuals. I'll let others sell you this Hoover. No sale necessary. I continue to recognize the overall expediency of the DNSBL kluge, and continue to use it myself. I wouldn't buy a Hoover anyway. I'm a Kirby kind of guy. I have a 1969 Dual Sanitronic 80 that my grandmother gave our family new, as a Christmas gift. https://c1.staticflickr.com/7/6071/6056367963_f06f08c7f6_z.jpg A 1976 Classic III that I picked up at a garage sale. http://cdn3.volusion.com/maxg3.xen6j/v/vspfiles/photos/KirbyClassicIII-4.jpg?1329982229 And a really cool model 516, manufactured in 1956 that someone had set out on the curb for garbage pickup, which I rescued and restored. http://www.1377731.com/kirby/516_5.jpg All stock photos. Not mine. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 03:54 PM, Steve Bergman wrote: On 07/02/2014 06:45 AM, Axb wrote: I'm pretty sure, a huge amount of SA users trust Spamhaus' ZEN at smtp level for outright rejects. At this point, I'm using the defaults, other than upping BAYES_999 enough to enough to total to 5.0 when added to BAYES_99. If a sender's IP is listed @Spamhaus , he has a serious problem reaching many, many destinations. Many, many destinations? Or a high percentage of destinations? I recently had to explain to the owner of the company why an important email from one of his business associates at another company was blocked. I told him that they were on a couple of spam block lists (which they were) and that contributed to the mail's rejection. I made the same pitch. "This should affect their outgoing mail to many sites, etc.". But I'm not sure I believe it. When I interact with people who've had their emails rejected (often related to DNSBLs) I've been listening for any mention of other mails of theirs to other companies being blocked. But when the DNSBL rules in SA are the major contributors to the rejecting, it seems that we are the only domain they interact with which is doing so. Entries in the DNSBL databases do great collateral damage. And of course none of these companies are spammers. They're with this or that ISP who has, at one time, had someone exploit their servers to send spam. DNSBL's are like a guy with a bazooka trying to play sniper. You are discussing about DNSBLs but not being specific. With millions of sessions/day I'm glad Spamhaus keeps my servers from melting. I'll let others sell you this Hoover.
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 06:45 AM, Axb wrote: I'm pretty sure, a huge amount of SA users trust Spamhaus' ZEN at smtp level for outright rejects. At this point, I'm using the defaults, other than upping BAYES_999 enough to enough to total to 5.0 when added to BAYES_99. If a sender's IP is listed @Spamhaus , he has a serious problem reaching many, many destinations. Many, many destinations? Or a high percentage of destinations? I recently had to explain to the owner of the company why an important email from one of his business associates at another company was blocked. I told him that they were on a couple of spam block lists (which they were) and that contributed to the mail's rejection. I made the same pitch. "This should affect their outgoing mail to many sites, etc.". But I'm not sure I believe it. When I interact with people who've had their emails rejected (often related to DNSBLs) I've been listening for any mention of other mails of theirs to other companies being blocked. But when the DNSBL rules in SA are the major contributors to the rejecting, it seems that we are the only domain they interact with which is doing so. Entries in the DNSBL databases do great collateral damage. And of course none of these companies are spammers. They're with this or that ISP who has, at one time, had someone exploit their servers to send spam. DNSBL's are like a guy with a bazooka trying to play sniper. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 10:47 AM, Steve Bergman wrote: The DNSBL's are problematic because so many ISP's mail servers are on them. We get quite a few emails from employees at companies who's ISP's are on Spamhaus lists, or whatever, due to nothing that has anything to do with them. I'm pretty sure, a huge amount of SA users trust Spamhaus' ZEN at smtp level for outright rejects. If a sender's IP is listed @Spamhaus , he has a serious problem reaching many, many destinations. If he's been expoited, you get good evidence and fast delisting processsing and I have yet to see a real FP with ZEN. Consider it being better a sender gets a hard reject than having msgs land in some spam folder and remain unseen. but then...
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 10:47 AM, Steve Bergman wrote: But for all the discussion today, we never really had a good talk about postscreen, which is something I'd like to hear someone expound a bit upon. probably Wrong list ... review Postfix list archives
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 10:47 AM, Steve Bergman wrote: I'll add you to the list of people telling me that jumping out of an airplane at 20,000 feet with nothing but a parachute and a pair of underwear is fun. Yep... it is... though you could catch a cold...
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 03:05 AM, Dave Funk wrote: Unless you've explicitly disabled them, the network based rules (razor, pyzor, dcc, DNS based rules, RBLs, URIBLs, etc) constitute an external 'reputation' system to pass judgment on messages. Actually, DCC is not included in the default due to arbitrary restrictions on request volume for the public servers. 100,000 per day or something. And neither is Pyzor, presumably for similar reasons? Razor2 is in by default. I use all these, but have reservations about them. DCC Pyzor and Razor2 are lists of bulk email. Not specifically of *unsolicited* bulk email. Many of my users are on lists of various sorts. The DNSBL's are problematic because so many ISP's mail servers are on them. We get quite a few emails from employees at companies who's ISP's are on Spamhaus lists, or whatever, due to nothing that has anything to do with them. It's not uncommon to take a low-scoring spam and find that it gets a higher score on retest as it has been added to various bad-boy lists. Except that the "bad-boy" lists flag more ham then spam. This is also one way that gray-listing helps. Review the thread. You don't want to talk to me about greylisting. ;-) But for all the discussion today, we never really had a good talk about postscreen, which is something I'd like to hear someone expound a bit upon. I've used site-wide Bayes with auto-learning at a site with ~3000 users and have had to flush & restart our Bayes database twice in 10 years. I'll add you to the list of people telling me that jumping out of an airplane at 20,000 feet with nothing but a parachute and a pair of underwear is fun. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 02:39 AM, Dave Funk wrote: Steve, For some reason you seem to be hung-up on Bayes "autolearning". Skip down the thread. I was demonstrated to be wrong. :-) It it possible that you're confusing it with "Auto-White listing"? (which is now deprecated and has -nothing- to do with Bayes). No. I know the difference. AWL, planned to be replaced with TxRep and all that. (I'd mention that TxRep has problems, but it's too late at night for me to engage in yet another argument.) SA's Bayesian scorer is a system based upon a method that parses a message, extracts 'tokens' from it and uses an algorithm to calculate a score for the message based upon a dictionary of previously seen tokens and their relative merit. Yeah. Bayesian statistics is pretty cool. or via an automated process from within SA as it scores messages (known as 'auto' learning). So regardless of whether manual or auto learning is utilized, tokens are added to the dictionary. See, that's where things stop making sense to me. I would not expect the Bayesian filter to do any better than it's training. And if it's training is via input from static rules (plus DNSBL's and DCC's) I would not expect it to be able to do any better. And it's not hard to imagine pathological behavior developing. But people are telling me different. And I'm open to considering alternative possibilities. It's also possible to employ both auto & manual learning methods in the same installation. That would be the scenario I am considering. There can be one dictionary used for scoring all messages processed (called "site wide Bayes") or many separate dictionaries, one used for each recognized user ("per user Bayes"). Either way, the dictionary(s) need to be updated (and the update process could be either manual, auto, or both). Yes. I've been devoted to individual fileDB's, each individually trained for a particular user's spam^Wemail stream. People are telling me that system-wide databases work well. It's been this way for the past 10+ years AFAIK (well, maybe 10 years ago it didn't have as many options for back-end database storage, mostly limited to Berkeley-DB type methods). I think it was around 2003, in SA 2.5(?) that SA got a Bayesian classifier. IIRC, there was a project called dspam (which I think is still around) For a while the dspam guys were pushing the fact that *dspam* was a modern spam filter, and SA was old, clunky, and too outdated to use. Anyway, in the very early versions of SA Bayes, everything was system-wide. Later they added the option to use individual user files. And the only info I've seen that described autolearn and how it worked was a mailing list post from 2004 which specifically stated that it was system-wide, in memory, and was lost upon restart. Maybe that's correct and maybe it's not. But today, it looks to be user-specific, if configured that way. I'm still working out whether I want to use it, and if so, how. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On Wed, 2 Jul 2014, Steve Bergman wrote: Well... I just turned on autolearn for a moment, deleted the bayes_* files on the test account I use, and sent myself a message from my usual outside account. And new bayes_* files were created. So I was wrong, and I win. More options. So now I can proceed to the "what does this mean?" phase. If I leave things as they are, then training is perfect if the users are diligent. But if they are not, then... what? I see plenty of spams getting through with a 0.0 score. IIRC, the autolearn spam threshold is 7? Pretty much everything there is spam. But I'm not sure I quite buy having the static rules of SA training Bayes. Isn't Bayes just learning to emulate the static rules, with all their imperfections? Unless you've explicitly disabled them, the network based rules (razor, pyzor, dcc, DNS based rules, RBLs, URIBLs, etc) constitute an external 'reputation' system to pass judgment on messages. It's not uncommon to take a low-scoring spam and find that it gets a higher score on retest as it has been added to various bad-boy lists. This is also one way that gray-listing helps. If you stiff-arm the first pass of a spam run a later check may hit it more accurately as it's been added to block-lists in the mean-time. If it starts going wrong, doesn't that mean the errors are going to spiral out of control? That is a possible risk of relying solely on auto-learning. The autolearn system has been carefully crafted and tuned over the years to try to prevent a feed-back loop from throwing it into a tail-spin. For example the internal scoring system used to determine if a message is spam or ham WRT the choice for auto-learning explicitly excludes the Bayes score (and other particular kinds of scores such as white/black lists) to try to prevent tail-eating. Occasional judicious manual learning can help to 'tweak' things when Bayes looks like it's not in top shape. (IE manual learning of FPs & FNs). I've used site-wide Bayes with auto-learning at a site with ~3000 users and have had to flush & restart our Bayes database twice in 10 years. Dave -- Dave Funk University of Iowa College of Engineering 319/335-5751 FAX: 319/384-0549 1256 Seamans Center Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527 #include Better is not better, 'standard' is better. B{
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 02:14 AM, Axb wrote: YOu don't need to trust me or believe me (I'm not selling anything - just commenting on what works for me) Well, I know you know what I meant. Ever thought of running a newer distro in a VM, only for SA and let spamass-milter use that? That would mean you can play with SA 3.4 without having to redo all your mail infra? I'm pushing to do our ubuntu 14.04 upgrade soon to get the dovecot full text search. And then a memory upgrade. And these days I just max them out on memory. 4GB -> 32GB. Plus adding a 4TB RAID1. So it ought to be able to handle almost anything. And I've just confirmed that SA 3.4 made it into 14.04. That should, at least, avert all those annoying "time to upgrade" responses like I got here earlier. It's very late here. 2:45AM, I see. But it's been a lot of fun arguing with you guys today. And thanks for all the help. Pyzor seems to be functioning fine now. General rules of thumb to keep in mind: Whenever there are inexplicable problems, it's probably selinux causing them. And if not that, regular old POSIX permissions. And if ever there is an article of clothing you need but can't find anywhere in the house, there's usually a dog sleeping on it. Or possibly a cat. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On Wed, 2 Jul 2014, Steve Bergman wrote: On 07/01/2014 11:49 PM, Karsten Bräckelmann wrote: Those do not tell you about using file or SQL based databases? They do. But not specifically with respect to autolearn. You never thought about googling for "spamassassin per user" and friends? You never checked the SA wiki? I have, indeed. No reference to autolearn and persistent storage. The lack of mention is notable. I'd expect people to be lining up to tell me I'm mistaken if I absolutely were. Can you point me to a change log somewhere documenting autolearn moving from in-memory and system-wide to per user and persistent? I don't hold a strong opinion on this. It would be nice if I were wrong. It would open more options. I'm just waiting for evidence that it's the case. My perception is that It's not. -Steve Steve, For some reason you seem to be hung-up on Bayes "autolearning". It it possible that you're confusing it with "Auto-White listing"? (which is now deprecated and has -nothing- to do with Bayes). SA's Bayesian scorer is a system based upon a method that parses a message, extracts 'tokens' from it and uses an algorithm to calculate a score for the message based upon a dictionary of previously seen tokens and their relative merit. The dictionary is created and updated by a process called 'learning' wherein already-classified messages are tokenized and their tokens are stored in the dictionary along with a merit value derived from their instance count and a factor taken from being classified as spam or ham. This learning process can be either externally driven (known as 'manual' learning) or via an automated process from within SA as it scores messages (known as 'auto' learning). So regardless of whether manual or auto learning is utilized, tokens are added to the dictionary. It's also possible to employ both auto & manual learning methods in the same installation. There can be one dictionary used for scoring all messages processed (called "site wide Bayes") or many separate dictionaries, one used for each recognized user ("per user Bayes"). Either way, the dictionary(s) need to be updated (and the update process could be either manual, auto, or both). The Bayes dictionary(s) need to be stored some how, the usual method is via some kind of database. It could be a simple file based DB, some kind of fancy SQL server based system or something else. This is a DBA'ish kind of choice as to what particular technology is used to store the dictionary DB. (usually on disk in some way, could be in some kind of memory resident set of tables, or something else???). So you have a multi-dimensional matrix WRT your Bayes system configuration, and manual VS auto learning is just one factor. It's been this way for the past 10+ years AFAIK (well, maybe 10 years ago it didn't have as many options for back-end database storage, mostly limited to Berkeley-DB type methods). I hope this helps you. -- Dave Funk University of Iowa College of Engineering 319/335-5751 FAX: 319/384-0549 1256 Seamans Center Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527 #include Better is not better, 'standard' is better. B{
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 02:02 AM, Axb wrote: and don't count on that - they may do it the first week, new toy, but for how long? Not new. They'd previously been training SA with Evolution for some years. I have some confidence in many of them doing it right. Also: take in mind each user's Bayes folder also get a a bayes_seen file which grows and grows and grows and never gets truncated. Well, I have the maximum bayes toks set at 2,000,000. Is bayes_seen likely to become a problem with ~100 users and 4TB of disk space? My largest email volume user has accumulated only 320k of "seen" in 10 days. And I assume that repeat spams don't add to it. Do you really want to spend time watching each user's Bayes? Not really. But I'll do whatever is necessary. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 09:01 AM, Steve Bergman wrote: Axb, I'm not sure I quite believe it. And I'm not quite sure I trust you. But you do make an attractive pitch. Excellent spam filtering, system-wide, with no responsibility for training on the part of the users? YOu don't need to trust me or believe me (I'm not selling anything - just commenting on what works for me) You can try it and after a couple of weeks, see if it works for you and then if necessary come up with new methods for extra training or dump the concept totally. Bayes is yet another scoring mechanism in SA. If you have enough traffic, you can wipe the data any time and it's not like you're switching SA off totally. During the dev/test process of the Redis backend, as stuff changed on a daily basis I was forced to purge the Bayes data several times/week. It even became a running joke (wave Henrik/Marc). This sounds like the kind of "too good to be true" message that I'd expect to receive in a spam mail. :-) But hmm. This is good dream material for tonight. I wonder if our Ubuntu 14.04 upgrade has SA 3.4 with redis built in. I do hear that the redis backend is amazing. Ever thought of running a newer distro in a VM, only for SA and let spamass-milter use that? That would mean you can play with SA 3.4 without having to redo all your mail infra?
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 08:48 AM, Steve Bergman wrote: Someone, please convince me that I should turn it on. autolearn doesn't mean you cannot also train manually... Should I turn it on and take my "train as ham" entry out of .forward? Or should I not? manually training ham from unreviewed data? bad idea. I suppose that largely depends upon my individual users' levels of diligence. and don't count on that - they may do it the first week, new toy, but for how long? Also: take in mind each user's Bayes folder also get a a bayes_seen file which grows and grows and grows and never gets truncated. Do you really want to spend time watching each user's Bayes?
Re: Bayes, Manual and Auto Learning Strategies
Axb, I'm not sure I quite believe it. And I'm not quite sure I trust you. But you do make an attractive pitch. Excellent spam filtering, system-wide, with no responsibility for training on the part of the users? This sounds like the kind of "too good to be true" message that I'd expect to receive in a spam mail. But hmm. This is good dream material for tonight. I wonder if our Ubuntu 14.04 upgrade has SA 3.4 with redis built in. I do hear that the redis backend is amazing. -Steve
Re: Bayes, Manual and Auto Learning Strategies
Well... I just turned on autolearn for a moment, deleted the bayes_* files on the test account I use, and sent myself a message from my usual outside account. And new bayes_* files were created. So I was wrong, and I win. More options. So now I can proceed to the "what does this mean?" phase. If I leave things as they are, then training is perfect if the users are diligent. But if they are not, then... what? I see plenty of spams getting through with a 0.0 score. IIRC, the autolearn spam threshold is 7? Pretty much everything there is spam. But I'm not sure I quite buy having the static rules of SA training Bayes. Isn't Bayes just learning to emulate the static rules, with all their imperfections? If it starts going wrong, doesn't that mean the errors are going to spiral out of control? Leaving autolearn off puts everything in the hands of the users. And that's where I've left things for now. Someone, please convince me that I should turn it on. Should I turn it on and take my "train as ham" entry out of .forward? Or should I not? I suppose that largely depends upon my individual users' levels of diligence. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 08:00 AM, Steve Bergman wrote: On 07/02/2014 12:52 AM, Axb wrote: Site wide bayes works VERY well even under such ugly conditions as traffic with multiple languages, for ham as well as spam. Please tell me more. This goes against Paul Graham's orginal advice, IIRC. And it goes against intuition. Then again. Bayesian statistics go against intuition. It's hard to let go and trust a systen-wide Bayes. But I'm listening... It works, trust me. SA's Bayes implementation is incredibly robust. My site wide Bayes DB is not exactly small. 0.000 0 23850755 0 non-token data: nspam 0.000 0 10702302 0 non-token data: nham Would I run a monster this size of it didn't work? Nope. I waited a long time to be able to use something really 100% site wide (not per server) till we got the ability to use Redis which was FAST, robust and doesn't cause me headaches as sql, file permissions issues, etc. I can't give you a scientific reason for not using per user Bayes Site wide works for my +2000 corp domains which includes .tr, .ru, .cn, .ua, .es, .fr,.de plus a ton of other major CCtld domains AND: I only run autolearn. NO manual/scheduled training.
Re: Bayes, Manual and Auto Learning Strategies
On Wed, 2 Jul 2014, Steve Bergman wrote: On 07/01/2014 11:14 PM, John Hardin wrote: Autolearn trains the bayes database. The bayes data is stored wherever you configured it to be stored, in a DBM database or SQL or redis, and it's per-user if you configure per-user Bayes databases and scan emails using different usernames (vs. a global user like root or amavis). That is interesting. How sure are you of this? Because if you're pretty sure, it's a piece of information I've been keen to confirm for a while. The bayes database is the only thing in SA that can be trained. (I'm excluding submission of the message to pyzor et. al. because that's obviously not local.) Odd, though, that before I set up .forward to train incoming mails as ham and disabled autolearn, no nhams were showing up in "sa-learn --dump magic" for the individual users. Just nspams. That is rather odd. Very-low-scoring hams should be autolearned as ham unless the default thresholds have been changed. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- News flash: Lowest Common Denominator down 50 points --- 3 days until the 238th anniversary of the Declaration of Independence
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 12:52 AM, Axb wrote: Site wide bayes works VERY well even under such ugly conditions as traffic with multiple languages, for ham as well as spam. Please tell me more. This goes against Paul Graham's orginal advice, IIRC. And it goes against intuition. Then again. Bayesian statistics go against intuition. It's hard to let go and trust a systen-wide Bayes. But I'm listening... -Steve
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 07:37 AM, Steve Bergman wrote: Lets turn this around? Can you prove autolearn was ever done to memory? I'm not really interested in proving anything. I'm interested in being convinced that autolearn is individual file-based when spamc is run as the individual user. It's in the code... but yes, autolearn is always file based and respects the per user settings unless you run spamd with -x I'm not quite sure how that would affect my strategy. But it might (or might not) make autolearn useful. More important, you may need to reconsider is if per user Bayes will give you the level of quality you're aiming for, and from experience I can tell you: it won't. Site wide bayes works VERY well even under such ugly conditions as traffic with multiple languages, for ham as well as spam.
Re: Bayes, Manual and Auto Learning Strategies
Lets turn this around? Can you prove autolearn was ever done to memory? I'm not really interested in proving anything. I'm interested in being convinced that autolearn is individual file-based when spamc is run as the individual user. I'm not quite sure how that would affect my strategy. But it might (or might not) make autolearn useful. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On 07/02/2014 07:19 AM, Steve Bergman wrote: On 07/01/2014 11:49 PM, Karsten Bräckelmann wrote: Those do not tell you about using file or SQL based databases? They do. But not specifically with respect to autolearn. You never thought about googling for "spamassassin per user" and friends? You never checked the SA wiki? I have, indeed. No reference to autolearn and persistent storage. The lack of mention is notable. I'd expect people to be lining up to tell me I'm mistaken if I absolutely were. Can you point me to a change log somewhere documenting autolearn moving from in-memory and system-wide to per user and persistent? I don't hold a strong opinion on this. It would be nice if I were wrong. It would open more options. I'm just waiting for evidence that it's the case. My perception is that It's not. Lets turn this around? Can you prove autolearn was ever done to memory? If you mean "autolearn to journal", this is also file based. I've been using SA since before it was an Apache project, when it was developed by McAfee and the sources were on Sourceforge and back then it was already file based.
Re: Bayes, Manual and Auto Learning Strategies
On 07/01/2014 11:14 PM, John Hardin wrote: Autolearn trains the bayes database. The bayes data is stored wherever you configured it to be stored, in a DBM database or SQL or redis, and it's per-user if you configure per-user Bayes databases and scan emails using different usernames (vs. a global user like root or amavis). That is interesting. How sure are you of this? Because if you're pretty sure, it's a piece of information I've been keen to confirm for a while. Odd, though, that before I set up .forward to train incoming mails as ham and disabled autolearn, no nhams were showing up in "sa-learn --dump magic" for the individual users. Just nspams. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On 07/01/2014 11:49 PM, Karsten Bräckelmann wrote: Those do not tell you about using file or SQL based databases? They do. But not specifically with respect to autolearn. You never thought about googling for "spamassassin per user" and friends? You never checked the SA wiki? I have, indeed. No reference to autolearn and persistent storage. The lack of mention is notable. I'd expect people to be lining up to tell me I'm mistaken if I absolutely were. Can you point me to a change log somewhere documenting autolearn moving from in-memory and system-wide to per user and persistent? I don't hold a strong opinion on this. It would be nice if I were wrong. It would open more options. I'm just waiting for evidence that it's the case. My perception is that It's not. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On Tue, 2014-07-01 at 22:40 -0500, Steve Bergman wrote: > On 07/01/2014 10:21 PM, Karsten Bräckelmann wrote: > > > > http://spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html > > http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html > > I've read those over and over. It never says anything about where the > data is maintained, or whether it's per-user or not. The *only* solid > claim I have is a ten year old (yes, at the dawn of SA Bayes) post which > specifically says it's in memory, system-wide, and lost upon SA restart. Those do not tell you about using file or SQL based databases? You never thought about googling for "spamassassin per user" and friends? You never checked the SA wiki? FWIW, the links given do NOT refer to in-memory only at all. An in-memory only Bayes database definitely is much more than ten years ago. If it ever existed. No need for me to even check. > > Milter usually means system-wide. (But since you just asked, it is.) > > I'm using spamass-milter. It suid's to the recipient user for most > mails. For aliases it defaults to a particular user who gets an > unbelievable amount of spam at the gate, and whom I know sorts his > ham/spam religiously. So you want to check back with your specific setup and its docs. Suid'ing is pretty likely to be per-user, though the definition of user is not specifically clear in the context of a milter (and the final recipient). In either case, that is not SA specific. (SA happily uses both, per-user or site-wide config AND bayes database, depending on context.) Refer to your milter's docs. > > Irrespective of your feeling -- cheers! /me having a beer > > Whew! After the conversations I've had here, today, I need one, too! ;-) Don't see this as an attack on you. It isn't. Just pointers on helping your understanding of the situation and your issues. Not always gentle, but that also reflects the initial stance. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Bayes, Manual and Auto Learning Strategies
On Tue, 2014-07-01 at 22:18 -0500, Steve Bergman wrote: > On 07/01/2014 09:53 PM, Karsten Bräckelmann wrote: > > > Frankly, it appears you don't understand what auto-learning is. > > So please specify, explicitly, what it is. I asked some specific > questions about it. And I'm very interested in the answers. If you want my opinion, please re-phrase your questions. I locally deleted most of this previous (originally unrelated) thread. > Is auto-learn still system-wide? I'd need it to apply to individual > users. Is it in-memory only? Or can I have it update the users' filedb > token databases? SA itself never was system-wide, neither user-specific. It is both, can be either. It depends on the context of calling SA. > If it's now per user and uses the user databases, then I am more than > ready to reconsider my opinion. But I've not been able to get a clear > answer to this. I haven't had an opportunity to test. And I'd want > confirmation from someone in the know anyway, before I changed strategies. It does not depend on SA, but on how you invoke SA. We cannot give you a clear answer. It depends on your system, your SMTP, glue, system wide calling of SA, and possibly per-user invocations even after system-wide. To be clear: SA is a filter. It does nothing itself, other than classification. Being called, and at which point, is outside the scope of SA. Rejecting, deleting, delivering or any other kind of action is outside the scope of SA. That's actions performed by the calling layer, based on the result of SA evaluation. > >> This method shields the user from the worst of the spam, while giving > >> them full control of what gets relearned as spam. > > > > Wrong. It is not "this" (your) method, that shields the user from the > > worst of the spam. That's SA. Not your style of auto-training. > > Mine is not autotraining at all. it's giving the user a way of > explicitly training the backend spam filter. Quoting your previous post, you "have a line in the users' default .forward file to train incoming mail as ham". That is auto-training. > > (Besides, you *are* doing auto-learning, which you just claimed to be a > > complete joke.) > > No. The messages are assumed ham until the user classifies it as spam. > It is explicit learning. Under user control, Being "assumed" is not the same as being "treated and automatically reinforced". The latter is what you do. (And btw, Yes. You are auto-learning.) > > At this point I won't get into details. It should suffice to highlight > > that a default ham auto-learning threshold of 0.1 is part of the safety > > concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.) > > I really don't think you understand what it is I'm doing. Anything below > a score of 5.0 goes into their mailbox and learned as ham. If it's ham, > that's great. If it's spam, they move it to Junk and it gets learned as > spam. auto-learn is as brain dead as the defunct AWL. I perfectly understood what you are doing. You didn't understand why that is bad. Failing to explain might be my bad, though I'll leave re-explaining for tomorrow my timezone. Or you carefully re-reading my posts. > > I never checked the TB internal Bayes implementation and auto-learn > > strategy, but I'd be surprised if they do train on black/white, without > > any gray area in between. > > Optimally, I would have an "incoming folder" and then the user could > manually move the messages from there to spam or ham. But considering Which is basically what you came from, using Dovecot antispam plugin with SA, and dedicated folders "where the user could manually move the messages" to. Why didn't you just set that up? (Hint: That's your set-up without auto-learning ham Inbox deliveries.) > that this was not even remotely necessary with our old email provider, I > don't feel that I can put my users to that level of extra trouble that > they never even thought about having to deal with before, just because > SA is not performing as well as the spam filter they are used to. The Do initial manual training. Then get back to us. > mail needs to go into the inbox directly. And for SA's bayesian tp work, > it needs to be assumed as ham initially. No. It seems your previous "email provider", whatever that might be, had some sort of spam filtering service. Now you're on your own. Which you are, unless you decide to ask for free (as in beer) support by the community providing the software for free (as in speech) to help you weed out the spam. You did ask, which is just fine, but your assumptions are ki
Re: Bayes, Manual and Auto Learning Strategies
On Tue, 1 Jul 2014, Steve Bergman wrote: On 07/01/2014 10:21 PM, Karsten Bräckelmann wrote: http: //spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html http: //spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html I've read those over and over. It never says anything about where the data is maintained, or whether it's per-user or not. The *only* solid claim I have is a ten year old (yes, at the dawn of SA Bayes) post which specifically says it's in memory, system-wide, and lost upon SA restart. Autolearn trains the bayes database. The bayes data is stored wherever you configured it to be stored, in a DBM database or SQL or redis, and it's per-user if you configure per-user Bayes databases and scan emails using different usernames (vs. a global user like root or amavis). -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- News flash: Lowest Common Denominator down 50 points --- 3 days until the 238th anniversary of the Declaration of Independence
Re: Bayes, Manual and Auto Learning Strategies
On 07/01/2014 10:21 PM, Karsten Bräckelmann wrote: http://spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html I've read those over and over. It never says anything about where the data is maintained, or whether it's per-user or not. The *only* solid claim I have is a ten year old (yes, at the dawn of SA Bayes) post which specifically says it's in memory, system-wide, and lost upon SA restart. Milter usually means system-wide. (But since you just asked, it is.) I'm using spamass-milter. It suid's to the recipient user for most mails. For aliases it defaults to a particular user who gets an unbelievable amount of spam at the gate, and whom I know sorts his ham/spam religiously. Which, referring to my previous post, also means, a single sloppy user deleting your custom-auto-learned FN ham messages affects all your other users. No. I make sure to keep each user solely responsible for their own email welfare. Irrespective of your feeling -- cheers! /me having a beer Whew! After the conversations I've had here, today, I need one, too! ;-) -Steve
Re: Bayes, Manual and Auto Learning Strategies
On Tue, 2014-07-01 at 20:53 -0500, Steve Bergman wrote: > On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote: > > > That's pretty bad practice. Fundamentally, you are implementing a custom > > auto-learn flavor, overruling the SA configurable auto-learn behavior > > BTW, that reminds me of a question I had been meaning to ask on the > list. Autolearn. There's very little written about it, so far as I am http://spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html > aware. But from what I have gleaned, from old posts, is that it is > system-wide and in-memory. It depends on how you call SA (SMTP or MDA level). SA itself is a filter, called by your mail-processing chain. Thus, there is no SA default context of system-wide or per-user. It depends on how you call it. > Now, I have Spamass-milter set to run SA 3.3 > as the recipient user, using the filedb backend. So in 3.3, is autolearn > system wide and in memory, or per user and on disk? Milter usually means system-wide. (But since you just asked, it is.) Which, referring to my previous post, also means, a single sloppy user deleting your custom-auto-learned FN ham messages affects all your other users. Or a non-sloppy, but on-vacation-mode user. Moreover, there is no in-memory only, not on-disk mode. Unless you don't have to ask about it. > This makes a difference regarding what Karsten and I are discussing. I > don't suppose I would object to being wrong. But I have a feeling that > I'm right. Irrespective of your feeling -- cheers! /me having a beer -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Bayes, Manual and Auto Learning Strategies
On 07/01/2014 09:53 PM, Karsten Bräckelmann wrote: Frankly, it appears you don't understand what auto-learning is. So please specify, explicitly, what it is. I asked some specific questions about it. And I'm very interested in the answers. Is auto-learn still system-wide? I'd need it to apply to individual users. Is it in-memory only? Or can I have it update the users' filedb token databases? If it's now per user and uses the user databases, then I am more than ready to reconsider my opinion. But I've not been able to get a clear answer to this. I haven't had an opportunity to test. And I'd want confirmation from someone in the know anyway, before I changed strategies. This method shields the user from the worst of the spam, while giving them full control of what gets relearned as spam. Wrong. It is not "this" (your) method, that shields the user from the worst of the spam. That's SA. Not your style of auto-training. Mine is not autotraining at all. it's giving the user a way of explicitly training the backend spam filter. And unless you disabled Bayes auto-learning in SA (dunno, might have been mentioned deep in the thread), the user does not have full control of what gets relearned as spam. I have disabled autolearning. I thought I mentioned that to you. (Besides, you *are* doing auto-learning, which you just claimed to be a complete joke.) No. The messages are assumed ham until the user classifies it as spam. It is explicit learning. Under user control, At this point I won't get into details. It should suffice to highlight that a default ham auto-learning threshold of 0.1 is part of the safety concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.) I really don't think you understand what it is I'm doing. Anything below a score of 5.0 goes into their mailbox and learned as ham. If it's ham, that's great. If it's spam, they move it to Junk and it gets learned as spam. auto-learn is as brain dead as the defunct AWL. I never checked the TB internal Bayes implementation and auto-learn strategy, but I'd be surprised if they do train on black/white, without any gray area in between. Optimally, I would have an "incoming folder" and then the user could manually move the messages from there to spam or ham. But considering that this was not even remotely necessary with our old email provider, I don't feel that I can put my users to that level of extra trouble that they never even thought about having to deal with before, just because SA is not performing as well as the spam filter they are used to. The mail needs to go into the inbox directly. And for SA's bayesian tp work, it needs to be assumed as ham initially. The only thing I see which might change my view would be explicit details about where autolearn stores its data and how it is used on a per user basis. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On Tue, 2014-07-01 at 20:36 -0500, Steve Bergman wrote: > On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote: > > > > That's pretty bad practice. Fundamentally, you are implementing a custom > > auto-learn flavor, overruling the SA configurable auto-learn behavior > > SA's autolearn behavior doesn't make much sense. I have no confidence in it. The auto-learning feature is NOT meant to be a fully automated training system. It's an aid for the user to eliminate the need to care about the extremes, while focusing on the close-calls. There are options to tweak to your specific needs, and there even is no single "SA autolearn behavior" as you stated, but different flavors. And an option to turn it off. Frankly, it appears you don't understand what auto-learning is. > This method shields the user from the worst of the spam, while giving > them full control of what gets relearned as spam. Wrong. It is not "this" (your) method, that shields the user from the worst of the spam. That's SA. Not your style of auto-training. And unless you disabled Bayes auto-learning in SA (dunno, might have been mentioned deep in the thread), the user does not have full control of what gets relearned as spam. > > and ignoring all safety concepts implemented by SA. > > What safety concepts? autolearn is a complete joke. Even the docs > explain that it's only there as a last resort method of kinda sorta > training the spam filter. You are doing (custom) auto-learning as ham of any message with a score less than required_score of 5.0. *That* is a joke. (Besides, you *are* doing auto-learning, which you just claimed to be a complete joke.) At this point I won't get into details. It should suffice to highlight that a default ham auto-learning threshold of 0.1 is part of the safety concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.) > > So if a user in a hurry simply deletes some spam, it will remain ham, as > > far as Bayes is concerned. > > Same as with Thunderbird, I think. I never checked the TB internal Bayes implementation and auto-learn strategy, but I'd be surprised if they do train on black/white, without any gray area in between. You stated it. Please back up your claim. > And it's working very well for them. > If they act irresponsibly, they'll get more spam. It takes no longer to > highlight the spam and click "Junk" than it does to highlight the spam > and click "Delete". While I am aware I'm not the average user -- there's a "delete" action key on my keyboard. There's no "junk" equivalent. Yes, I avoid using the mouse if keyboard interaction is more productive... > I've pretty much decided at this point that if the users don't do what I > tell them to do, repeatedly, then what results is not my responsibility. > > And it's not. Do you hate your users or your job? (Sorry, snide-remark I couldn't resist. Feel free to ignore.) > The alternative is to not mark incoming mail as ham, and allow the SA > Bayesian filter to remain inactive forever. No. I can only guess, but it appears there are some mis-interpretations in that conclusion. The SA Bayesian classifier to "remain inactive forever" can only refer to insufficient initial training. Manual training. Of at least 200 ham and spam each (by default, you can lower that to 0). You will easily get that by manual training of existing messages. And even default auto- learning would eventually cross the ham number. Less than forever. More importantly, SA still marks (classifies) incoming mail as ham. Just because its overall score is less than 5.0. It just does not *learn* all of them as ham. Because there's a chance it might not actually be ham, but a FN. That area, between (default) auto-learning as ham and classifying as spam is the gray area, where actual user input is of much value. For both, learning spam AND ham, for that matter. In particular, because generally (and as SA principle), a FP is *much* worse than a FN. Your approach of force learning those as ham, is biasing your Bayes DB. At the very least temporarily (unless a fresh spam campaign has been re-trained by your users on Monday). At worst, until you clear it. Btw, is that per-user, or are you gambling a site-wide Bayes DB? > I opted to give the users the choice of being responsible for sorting, > and reaping the benefits of that if they do. And yes, I know that some > are not going to. > > I'd be interested if you have a better solution in mind. Do not auto-learn ham every message that scores below required_score. Introduce train-on-error for your users, with an extended manual training option. Specific ham and spam folders, where moving or copying mail into trains the Bayes classifier. Kind of optional for the user, unless they feel there's too much mis-classification. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Bayes, Manual and Auto Learning Strategies
On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote: That's pretty bad practice. Fundamentally, you are implementing a custom auto-learn flavor, overruling the SA configurable auto-learn behavior BTW, that reminds me of a question I had been meaning to ask on the list. Autolearn. There's very little written about it, so far as I am aware. But from what I have gleaned, from old posts, is that it is system-wide and in-memory. Now, I have Spamass-milter set to run SA 3.3 as the recipient user, using the filedb backend. So in 3.3, is autolearn system wide and in memory, or per user and on disk? This makes a difference regarding what Karsten and I are discussing. I don't suppose I would object to being wrong. But I have a feeling that I'm right. -Steve
Re: Bayes, Manual and Auto Learning Strategies
On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote: That's pretty bad practice. Fundamentally, you are implementing a custom auto-learn flavor, overruling the SA configurable auto-learn behavior SA's autolearn behavior doesn't make much sense. I have no confidence in it. This method shields the user from the worst of the spam, while giving them full control of what gets relearned as spam. and ignoring all safety concepts implemented by SA. What safety concepts? autolearn is a complete joke. Even the docs explain that it's only there as a last resort method of kinda sorta training the spam filter. So if a user in a hurry simply deletes some spam, it will remain ham, as far as Bayes is concerned. Same as with Thunderbird, I think. And it's working very well for them. If they act irresponsibly, they'll get more spam. It takes no longer to highlight the spam and click "Junk" than it does to highlight the spam and click "Delete". I've pretty much decided at this point that if the users don't do what I tell them to do, repeatedly, then what results is not my responsibility. And it's not. The alternative is to not mark incoming mail as ham, and allow the SA Bayesian filter to remain inactive forever. I opted to give the users the choice of being responsible for sorting, and reaping the benefits of that if they do. And yes, I know that some are not going to. I'd be interested if you have a better solution in mind. -Steve
Bayes, Manual and Auto Learning Strategies (was: Re: getting tons of SPAM)
On Tue, 2014-07-01 at 18:43 -0500, Steve Bergman wrote: > On 07/01/2014 06:09 PM, RW wrote: > > I'm sceptical about the use of Dovecot-Antispam with Spamassassin. > > The problem is that it trains on SpamAssassin errors rather than Bayes > > errors. It may be possible to get sufficient spam this way, but ham > > is learned very slowly through avoidable FPs. > > We currently (early days for this installation) get plenty of spam for > the users to train by moving it to the junk folder. Ham was the problem. > Dovecot does nothing about training ham. Dovecot (and its antispam plugin) does nothing about training ham, either. It offers target folders and triggers, for easy manual (re-) classification -- and thus training -- of ham and spam. > That's why I have a line in the users' default .forward file to train > incoming mail as ham. That's pretty bad practice. Fundamentally, you are implementing a custom auto-learn flavor, overruling the SA configurable auto-learn behavior and ignoring all safety concepts implemented by SA. There's a reason for the ham and spam learning thresholds, and the ham threshold to be 0.1 by default, *not* equaling required_score's default of 5.0. > Then if they or Thunderbird decide to move the mail to Junk, it gets > re-trained as spam. So if a user in a hurry simply deletes some spam, it will remain ham, as far as Bayes is concerned. > dovecot-antispam is *not* a complete solution, so far as I can see. > > At this early stage, it *is* painful to watch all that spam coming in > over the weekend getting trained as ham. I tell my users to mark it as > spam on Monday morning. And if they don't, I just figure it's not my fault. It is your fault to implement a broken training strategy. > Once the token databases get larger there won't be so much potential > flux back and forth, I guess. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Bayes auto-learning a bad idea?
On 28.09.11 10:07, Lars Jørgensen wrote: Not sure if this is the correct forum, but google couldn't help me (or I am too low on caffeine). I get a lot of spam that would have been flagged as such, but a bayes score of -1.9 pulls it down to hammy status. I train Bayes manually on the borderline cases, but also have auto-learning enabled. Is that really a bad idea? Should I disable it, delete the bayes-databases and start over on manual-only learning? do you run manual learning? Keeping it only automatic learning can easily make things go wrong and let people think bayes is bad. If you re-train on those that misfired, you should get BAYES hitting properly soon. (Providing you didn't misconfigure on e.g. trusted_networks or internal_networks. That could break SA very "effectively"). -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Posli tento mail 100 svojim znamim - nech vidia aky si idiot Send this email to 100 your friends - let them see what an idiot you are
Re: Bayes auto-learning a bad idea?
On Wed, 28 Sep 2011 14:30:32 +0200 Lars Jørgensen wrote: > Looking at > http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html#learning_options > > i see an option called "bayes_use_hapaxes" that promises > significantly better hit-rates, but also increases database size by a > factor of 8 to 10. I've never understood what this is supposed to mean, and I suspect it it's just plain wrong. bayes_use_hapaxes determines whether hapaxes (tokens with a total count of 1) are used in the calculation. It doesn't affect whether they are stored; and it can't since all tokens start-off as hapaxes. It might have a marginal effect through the updating of atimes, but in that case it's expediting the removal of the most useful hapaxes. > What is the recommendation on this? I'd leave it on.
Re: Bayes auto-learning a bad idea?
On Wed, 28 Sep 2011 14:30:32 +0200, Lars Jørgensen wrote: On 28-09-2011 13:20, Benny Pedersen wrote: I train Bayes manually on the borderline cases, but also have auto-learning enabled. Is that really a bad idea? Should I disable it, delete the bayes-databases and start over on manual-only learning? no training is always good Are you missing a comma? Do you mean "no, training is always good" or "no training is always good"? no just my bolsk algebra and english is bad :) what score are you learning on ?, default is -0.1 and 12.0, i have changed them here to -4 and 14 Can't find any settings to that effect, so I guess I am using defaults. I have entered your settings in my config now. perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold Looking at http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html#learning_options i see an option called "bayes_use_hapaxes" that promises significantly better hit-rates, but also increases database size by a factor of 8 to 10. What is the recommendation on this? dont known for sure what is best there, using default here perldoc Mail::SpamAssassin::Plugin::Bayes perldoc Mail::SpamAssassin::Conf for 3.3.1 and above i add in local.cf bayes_auto_learn_on_error 1 reduce poising bayes and load If throughput is a factor in this decision, we are scanning about 60,000 to 90,000 mails a day. more then my server handle now what plugins have you enabled ? DCC pyzor/razor SpamCop AutoLearnThreshold TextCat MIMEHeader ReplaceTags DKIM Check HTTPSMismatch URIDetail Bayes All the EvalTest plugins VBounce ImageInfo FreeMail 3dr party rules or just default sa 3.3.2 ? Default and Sought Rules. should be safe enough to not give any problem to bayes tip if you like to restart learning bayes on can do this like here: sa-learn --dump magic bayes_min_ham_num (Default: 200) bayes_min_spam_num (Default: 200) and adjust this with 200 more then listed in dump magic, this ensure that bayes go back in learning mode
Re: Bayes auto-learning a bad idea?
On 28-09-2011 13:20, Benny Pedersen wrote: I train Bayes manually on the borderline cases, but also have auto-learning enabled. Is that really a bad idea? Should I disable it, delete the bayes-databases and start over on manual-only learning? no training is always good Are you missing a comma? Do you mean "no, training is always good" or "no training is always good"? what score are you learning on ?, default is -0.1 and 12.0, i have changed them here to -4 and 14 Can't find any settings to that effect, so I guess I am using defaults. I have entered your settings in my config now. Looking at http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html#learning_options i see an option called "bayes_use_hapaxes" that promises significantly better hit-rates, but also increases database size by a factor of 8 to 10. What is the recommendation on this? If throughput is a factor in this decision, we are scanning about 60,000 to 90,000 mails a day. what plugins have you enabled ? DCC pyzor/razor SpamCop AutoLearnThreshold TextCat MIMEHeader ReplaceTags DKIM Check HTTPSMismatch URIDetail Bayes All the EvalTest plugins VBounce ImageInfo FreeMail 3dr party rules or just default sa 3.3.2 ? Default and Sought Rules. -- Lars
Re: Bayes auto-learning a bad idea?
On Wed, 28 Sep 2011 10:07:55 +0200, Lars Jørgensen wrote: Hi, Not sure if this is the correct forum, but google couldn't help me (or I am too low on caffeine). I get a lot of spam that would have been flagged as such, but a bayes score of -1.9 pulls it down to hammy status. I train Bayes manually on the borderline cases, but also have auto-learning enabled. Is that really a bad idea? Should I disable it, delete the bayes-databases and start over on manual-only learning? no training is always good, its more like that bayes is unsure thats the problem, when it autolearn it does it on whole content/headers, so the more heders/content there is scanning of the better bayes can track what you want as ham/spam what score are you learning on ?, default is -0.1 and 12.0, i have changed them here to -4 and 14 what plugins have you enabled ? 3dr party rules or just default sa 3.3.2 ?
Bayes auto-learning a bad idea?
Hi, Not sure if this is the correct forum, but google couldn't help me (or I am too low on caffeine). I get a lot of spam that would have been flagged as such, but a bayes score of -1.9 pulls it down to hammy status. I train Bayes manually on the borderline cases, but also have auto-learning enabled. Is that really a bad idea? Should I disable it, delete the bayes-databases and start over on manual-only learning? -- Lars
Re: prevent rule from being considered for Bayes auto-learning
On 2010/10/21 12:17 PM, Michael Scheidell wrote: we decided that we didn't too much care to auto learn as 'not spam', emails sent from marketing companies, (because the reverse is true for auto learn ham) thus: aa_scores.cf:tflags RCVD_IN_DNSWL_HI net nice noautolearn aa_scores.cf:tflags RCVD_IN_DNSWL_MED net nice noautolearn aa_scores.cf:tflags RCVD_IN_DNSWL_LOW net nice noautolearn aa_scores.cf:tflags RCVD_IN_RP_SAFE net nice noautolearn aa_scores.cf:tflags RCVD_IN_RP_CERTIFIED net nice noautolearn I actually filed a bug on this... https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344 -- /Jason smime.p7s Description: S/MIME Cryptographic Signature
Re: prevent rule from being considered for Bayes auto-learning
On 21/10/2010 2:17 PM, Karsten Bräckelmann wrote: On Thu, 2010-10-21 at 18:39 +0200, Karsten Bräckelmann wrote: See M::SA::Plugin::AutoLearnThreshold. In a nutshell, (a) there are a few tflags that will prevent a rule's score to be used for auto-learning and (b) the score used is picked from the respective non-bayes score-set. With (a) you can make a rule invisible to the auto-learning decision. And by setting the scores for score-set 0 and 1 both to 0 as per (b), you can effectively disable a rule unless Bayes is enabled. ... *and* have that rule "ignored" for the auto-learning decision, if Bayes and auto-learn is enabled. (Actually not ignored, but adding zero doesn't influence the result. ;) The tflags way is much more straight forward, though. You cannot, however, create a rule to conditionally prevent auto- learning altogether (which, as I understand isn't what you had in mind anyway). Thanks everyone, I have set the rule to noautolearn using the tflags directive (this is what I wanted, for the rule to simply not be considered when auto-learning). - Lawrence
Re: prevent rule from being considered for Bayes auto-learning
On Thu, 2010-10-21 at 18:39 +0200, Karsten Bräckelmann wrote: > See M::SA::Plugin::AutoLearnThreshold. In a nutshell, (a) there are a > few tflags that will prevent a rule's score to be used for auto-learning > and (b) the score used is picked from the respective non-bayes > score-set. > > With (a) you can make a rule invisible to the auto-learning decision. > And by setting the scores for score-set 0 and 1 both to 0 as per (b), > you can effectively disable a rule unless Bayes is enabled. ... *and* have that rule "ignored" for the auto-learning decision, if Bayes and auto-learn is enabled. (Actually not ignored, but adding zero doesn't influence the result. ;) The tflags way is much more straight forward, though. > You cannot, however, create a rule to conditionally prevent auto- > learning altogether (which, as I understand isn't what you had in mind > anyway). -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: prevent rule from being considered for Bayes auto-learning
On Thu, 2010-10-21 at 13:27 -0230, Lawrence @ Rogers wrote: > I recall reading somewhere that there is a way to prevent a rule from > being considered for Bayes auto-learning. I am trying to create a rule ^ ^ > that hits upon some obvious spam that I am seeing, yet I want to make > sure (for now) that any scores it assigns are not used for anything > Bayes-related. I cannot seem to find any documentation on how to do this > (Google doesn't help). I think it is something to do with setting a > tflag, but any guidance would be appreciated. ^ Yup, that's correct. Though your google-fu today... The three marked strings from your own description leads to perfect documentation. :) See M::SA::Plugin::AutoLearnThreshold. In a nutshell, (a) there are a few tflags that will prevent a rule's score to be used for auto-learning and (b) the score used is picked from the respective non-bayes score-set. With (a) you can make a rule invisible to the auto-learning decision. And by setting the scores for score-set 0 and 1 both to 0 as per (b), you can effectively disable a rule unless Bayes is enabled. You cannot, however, create a rule to conditionally prevent auto- learning altogether (which, as I understand isn't what you had in mind anyway). -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: prevent rule from being considered for Bayes auto-learning
On 10/21/10 11:57 AM, Lawrence @ Rogers wrote: Hi, I recall reading somewhere that there is a way to prevent a rule from being considered for Bayes auto-learning. I am trying to create a rule that hits upon some obvious spam that I am seeing, yet I want to make sure (for now) that any scores it assigns are not used for anything Bayes-related. I cannot seem to find any documentation on how to do this (Google doesn't help). I think it is something to do with setting a tflag, but any guidance would be appreciated. you can prevent your rule from being considered in the DECISION as to if it will auto learn the tokens in the email with tflag noautolearn. I don't know of any flag that would prevent the rule itself, so example: rule1, hits 15 points rule2(bayes) hits 4 points. total is 19 points. if rule2 has noautolearn flag, then the 'do we auto learn this' score is only 15 points. if your threshold is > 15.1, then the whole email is not considered for auto learning. if rule3 hits 2 points, your total score is 21 points, but 'decision' delta is at 17 points now, and whole email is autolearned as spam. we decided that we didn't too much care to auto learn as 'not spam', emails sent from marketing companies, (because the reverse is true for auto learn ham) thus: aa_scores.cf:tflags RCVD_IN_DNSWL_HI net nice noautolearn aa_scores.cf:tflags RCVD_IN_DNSWL_MED net nice noautolearn aa_scores.cf:tflags RCVD_IN_DNSWL_LOW net nice noautolearn aa_scores.cf:tflags RCVD_IN_RP_SAFE net nice noautolearn aa_scores.cf:tflags RCVD_IN_RP_CERTIFIED net nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_UT_CPR_MAT net nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_UT_CPR_30 net nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_UT_CPEARnet nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_UNVERIFIED_2net nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_UNVERIFIED_1net nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_SPF net nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_SENDERIDnet nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_RDNSnet nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_OPTIN_LT50 net nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_OPTIN_GT50 net nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_OPTINnet nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_ML_DOPTINnet nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_MI_CPR_MATnet nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_MI_CPR_30net nice noautolearn aa_scores.cf:tflags RCVD_IN_IADB_MI_CPEARnet nice noautolearn Regards, Lawrence Williams LCWSoft www.lcwsoft.com -- Michael Scheidell, CTO o: 561-999-5000 d: 561-948-2259 ISN: 1259*1300 >*| *SECNAP Network Security Corporation * Certified SNORT Integrator * 2008-9 Hot Company Award Winner, World Executive Alliance * Five-Star Partner Program 2009, VARBusiness * Best in Email Security,2010: Network Products Guide * King of Spam Filters, SC Magazine 2008 __ This email has been scanned and certified safe by SpammerTrap(r). For Information please see http://www.secnap.com/products/spammertrap/ __
RE: prevent rule from being considered for Bayes auto-learning
Lawrence @ Rogers wrote: > Hi, > > I recall reading somewhere that there is a way to prevent a rule from > being considered for Bayes auto-learning. I am trying to create a > rule that hits upon some obvious spam that I am seeing, yet I want to > make sure (for now) that any scores it assigns are not used for > anything Bayes-related. I cannot seem to find any documentation on > how to do this (Google doesn't help). I think it is something to do > with setting a tflag, but any guidance would be appreciated. I think you're looking for this: tflags YOUR_RULENAME noautolearn HTH... ...Kevin -- Kevin MillerRegistered Linux User No: 307357 CBJ MIS Dept. Network Systems Admin., Mail Admin. 155 South Seward Street ph: (907) 586-0242 Juneau, Alaska 99801fax: (907 586-4500
prevent rule from being considered for Bayes auto-learning
Hi, I recall reading somewhere that there is a way to prevent a rule from being considered for Bayes auto-learning. I am trying to create a rule that hits upon some obvious spam that I am seeing, yet I want to make sure (for now) that any scores it assigns are not used for anything Bayes-related. I cannot seem to find any documentation on how to do this (Google doesn't help). I think it is something to do with setting a tflag, but any guidance would be appreciated. Regards, Lawrence Williams LCWSoft www.lcwsoft.com
Re: Mailbox for auto learning
Le mardi 11 août 2009 05:12:05, Cedric Knight a écrit : > Luis Daniel Lucio Quiroz wrote: > > Le lundi 10 août 2009 19:15:15, Cedric Knight a écrit : > >> Stefan wrote: > > [...] > > >>> You have to forward the message as an attachment un unpack it after > >>> receiving. Have a look at: > >>> https://po2.uni-stuttgart.de/~rusjako/sal-wrapper > >> > >> Yes, I find this approach works well. It's the simplest way for me to > >> train Bayes, and most users can cope with it, providing they're not > >> using Outlook 2003/XP which can't forward as an attachment. But > >> Thunderbird, Outlook Express, Squirrelmail and Pine all can easily. > >> It's not as simple as a 'This Is Spam' button perhaps, and that's a > >> *good* thing. Requiring a little bit of thought stops people using it > >> as an alternative to the delete key for 'OK, perhaps I did subscribe to > >> this but I don't want it now'. > > [...] > > > Yes but problem is that 99% of users are about using some kind of outlook > > Well then, tell them not to :) Outlook Express and Windows Mail are > fine. Outlook 2003 supposedly needs a special program like > http://www.olspamcop.org/ to forward properly, although if you select > multiple messages to forward, then it will forward them in some kind of > possibly useful digest format. Outlook 2007 introduces an explicit menu > item called "forward as an attachment" (Ctrl+Alt+F) but still mangles > the headers: > http://forum.spamcop.net/forums/index.php?showtopic=10241&st=0&p=70453ent >ry70453 > > Outlook 2007 also mangles the headers (kind of reconstructing a > misleading semblance of what the original was) when moving between IMAP > folders. Therefore, I wouldn't use spamassassin -r on spam from Outlook > users, but sa-learn to get tokens from the body text may be OK. > > Actually, some users of Outlook 2003 do seem to be able to forward as > intact message/rfc822 attachment. Not exactly sure how. > > Anyway, the 1% using a better e-mail program may be all that's needed to > train Bayes. > > CK Tha nkx I did resolve it by using altermime+postfix solution. I look my X-quarantine heather to get the mail_id and then i add that file. Rustique, mais il marche LD
Re: Mailbox for auto learning
Luis Daniel Lucio Quiroz wrote: > Le lundi 10 août 2009 19:15:15, Cedric Knight a écrit : >> Stefan wrote: [...] >>> You have to forward the message as an attachment un unpack it after >>> receiving. Have a look at: >>> https://po2.uni-stuttgart.de/~rusjako/sal-wrapper >> Yes, I find this approach works well. It's the simplest way for me to >> train Bayes, and most users can cope with it, providing they're not >> using Outlook 2003/XP which can't forward as an attachment. But >> Thunderbird, Outlook Express, Squirrelmail and Pine all can easily. >> It's not as simple as a 'This Is Spam' button perhaps, and that's a >> *good* thing. Requiring a little bit of thought stops people using it >> as an alternative to the delete key for 'OK, perhaps I did subscribe to >> this but I don't want it now'. [...] > Yes but problem is that 99% of users are about using some kind of outlook Well then, tell them not to :) Outlook Express and Windows Mail are fine. Outlook 2003 supposedly needs a special program like http://www.olspamcop.org/ to forward properly, although if you select multiple messages to forward, then it will forward them in some kind of possibly useful digest format. Outlook 2007 introduces an explicit menu item called "forward as an attachment" (Ctrl+Alt+F) but still mangles the headers: http://forum.spamcop.net/forums/index.php?showtopic=10241&st=0&p=70453entry70453 Outlook 2007 also mangles the headers (kind of reconstructing a misleading semblance of what the original was) when moving between IMAP folders. Therefore, I wouldn't use spamassassin -r on spam from Outlook users, but sa-learn to get tokens from the body text may be OK. Actually, some users of Outlook 2003 do seem to be able to forward as intact message/rfc822 attachment. Not exactly sure how. Anyway, the 1% using a better e-mail program may be all that's needed to train Bayes. CK
Re: Mailbox for auto learning
Le lundi 10 août 2009 19:15:15, Cedric Knight a écrit : > Stefan wrote: > > Am Sonntag, 9. August 2009 07:36:54 schrieb Luis Daniel Lucio Quiroz: > >> Hi SAs, > >> > >> Well, after reading this link > >> http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still > >> looking for an easy-way to let my mortal users to train our antispam. I > >> was thinking a mailbox such as h...@antispamserver and > >> s...@antispamserver to let users to forward their false positivos or > >> their false netgatives. In isde each box (ham or spam), of course a > >> procmail with sa-learn input will be forwarded. > >> > >> My doubts are nexts: > >> 1. Will forwarded mails be usefull for training, I mean if spam was: > >> From: spa...@example.netTo: u...@mydomain, when forwarding it will > >> be From: mu...@mydomain To: s...@antispamserver. Change of this and > >> forwarding (getting rid of headers because mail-clients) wont change > >> learning? > > > > You have to forward the message as an attachment un unpack it after > > receiving. Have a look at: > > https://po2.uni-stuttgart.de/~rusjako/sal-wrapper > > Yes, I find this approach works well. It's the simplest way for me to > train Bayes, and most users can cope with it, providing they're not > using Outlook 2003/XP which can't forward as an attachment. But > Thunderbird, Outlook Express, Squirrelmail and Pine all can easily. > It's not as simple as a 'This Is Spam' button perhaps, and that's a > *good* thing. Requiring a little bit of thought stops people using it > as an alternative to the delete key for 'OK, perhaps I did subscribe to > this but I don't want it now'. > > My script is very similar to sal-wrapper, using Postfix > check_recipient_access to ensure only authenticated users can send to > the reporting address; triggered from procmail; using MIME::Parser to > extract (possibly multiple) message/rf822 attachments; feed through > sa-learn --ham or spamassassin -r as appropriate and send an > acknowledgement back to the user, to remind them to also send > spam/non-spam to the corresponding address and correct any mistakes. > > One thing I notice from sal-wrapper however is that it pipes the header > and body to sa-learn without passing a file as parameter. I found that > although sa-learn didn't complain, this didn't work at all well, and > quite short ham messages were scoring BAYES_99. You can pipe to > spamassassin -r just like you can to spamassassin in any other mode, but > I think if you pipe to sa-learn, you need to do it as >sa-learn --ham - > > with the '-' as parameter, so it reads the standard input. > Alternatively feed it a temporary message file. Or am I misreading > something? > > CK Yes but problem is that 99% of users are about using some kind of outlook
Re: Mailbox for auto learning
Stefan wrote: > Am Sonntag, 9. August 2009 07:36:54 schrieb Luis Daniel Lucio Quiroz: >> Hi SAs, >> >> Well, after reading this link >> http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still >> looking for an easy-way to let my mortal users to train our antispam. I >> was thinking a mailbox such as h...@antispamserver and s...@antispamserver >> to let users to forward their false positivos or their false netgatives. >> In isde each box (ham or spam), of course a procmail with sa-learn input >> will be forwarded. >> >> My doubts are nexts: >> 1. Will forwarded mails be usefull for training, I mean if spam was: From: >> spa...@example.netTo: u...@mydomain, when forwarding it will be From: >> mu...@mydomain To: s...@antispamserver. Change of this and forwarding >> (getting rid of headers because mail-clients) wont change learning? > > You have to forward the message as an attachment un unpack it after > receiving. > Have a look at: > https://po2.uni-stuttgart.de/~rusjako/sal-wrapper Yes, I find this approach works well. It's the simplest way for me to train Bayes, and most users can cope with it, providing they're not using Outlook 2003/XP which can't forward as an attachment. But Thunderbird, Outlook Express, Squirrelmail and Pine all can easily. It's not as simple as a 'This Is Spam' button perhaps, and that's a *good* thing. Requiring a little bit of thought stops people using it as an alternative to the delete key for 'OK, perhaps I did subscribe to this but I don't want it now'. My script is very similar to sal-wrapper, using Postfix check_recipient_access to ensure only authenticated users can send to the reporting address; triggered from procmail; using MIME::Parser to extract (possibly multiple) message/rf822 attachments; feed through sa-learn --ham or spamassassin -r as appropriate and send an acknowledgement back to the user, to remind them to also send spam/non-spam to the corresponding address and correct any mistakes. One thing I notice from sal-wrapper however is that it pipes the header and body to sa-learn without passing a file as parameter. I found that although sa-learn didn't complain, this didn't work at all well, and quite short ham messages were scoring BAYES_99. You can pipe to spamassassin -r just like you can to spamassassin in any other mode, but I think if you pipe to sa-learn, you need to do it as sa-learn --ham - with the '-' as parameter, so it reads the standard input. Alternatively feed it a temporary message file. Or am I misreading something? CK
Re: Mailbox for auto learning
> Stefan wrote: > This may not be ideal, but in Thunderbird, you can drag > messages between mailboxes. You could setup each user to > have access to their own account and the two learning > mailboxes. You can then have your users drag the false > positives/negatives to the appropriate box. I have not > testing this 100%, so I don't know if any headers get > re-written or not. This is possible only when using IMAP. Not POP. When using IMAP, it is also possible to use folders, no need for separate mailboxes. But there will be no difference in using mailboxes or folders, it just works. No header modifications take place on a message when dragging it from folder into another, or from mailbox to another. But as the OP thinks about separate mailboxes, I am afraid that is because he has no folders available. That must be because his users are tied to POP3.
Re: Mailbox for auto learning
Stefan wrote: Am Sonntag, 9. August 2009 07:36:54 schrieb Luis Daniel Lucio Quiroz: Hi SAs, Well, after reading this link http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still looking for an easy-way to let my mortal users to train our antispam. I was thinking a mailbox such as h...@antispamserver and s...@antispamserver to let users to forward their false positivos or their false netgatives. In isde each box (ham or spam), of course a procmail with sa-learn input will be forwarded. My doubts are nexts: 1. Will forwarded mails be usefull for training, I mean if spam was: From: spa...@example.netTo: u...@mydomain, when forwarding it will be From: mu...@mydomain To: s...@antispamserver. Change of this and forwarding (getting rid of headers because mail-clients) wont change learning? You have to forward the message as an attachment un unpack it after receiving. Have a look at: https://po2.uni-stuttgart.de/~rusjako/sal-wrappe 2. If technique in question 1 is usless, what other way would be nice to let user to report a false positive/negative for training. This may not be ideal, but in Thunderbird, you can drag messages between mailboxes. You could setup each user to have access to their own account and the two learning mailboxes. You can then have your users drag the false positives/negatives to the appropriate box. I have not testing this 100%, so I don't know if any headers get re-written or not. -- Dan Schaefer Web Developer/Systems Analyst Performance Administration Corp.
Re: Mailbox for auto learning
Am Sonntag, 9. August 2009 07:36:54 schrieb Luis Daniel Lucio Quiroz: > Hi SAs, > > Well, after reading this link > http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still > looking for an easy-way to let my mortal users to train our antispam. I > was thinking a mailbox such as h...@antispamserver and s...@antispamserver > to let users to forward their false positivos or their false netgatives. > In isde each box (ham or spam), of course a procmail with sa-learn input > will be forwarded. > > My doubts are nexts: > 1. Will forwarded mails be usefull for training, I mean if spam was: From: > spa...@example.netTo: u...@mydomain, when forwarding it will be From: > mu...@mydomain To: s...@antispamserver. Change of this and forwarding > (getting rid of headers because mail-clients) wont change learning? You have to forward the message as an attachment un unpack it after receiving. Have a look at: https://po2.uni-stuttgart.de/~rusjako/sal-wrapper > 2. If technique in question 1 is usless, what other way would be nice to > let user to report a false positive/negative for training. > > TIA > LD Greetings Stefan
Re: Mailbox for auto learning
Le dimanche 9 août 2009 10:56:59, Benny Pedersen a écrit : > On Sun, 9 Aug 2009 00:36:54 -0500, Luis Daniel Lucio Quiroz > > > 1. Will forwarded mails be usefull for training, I mean if spam was: > > From: > > spa...@example.netTo: u...@mydomain, when forwarding it will be > > From: > > mu...@mydomain To: s...@antispamserver. Change of this and forwarding > > (getting rid of headers because mail-clients) wont change learning? > > > > 2. If technique in question 1 is usless, what other way would be nice to > > let > > user to report a false positive/negative for training. > > dovecot-antispam solves it with dovecot > > all users need to do is move mail in imap to junk folder, in that task > dovecot-antispam call sa-learn > > this means no junk plugins to windows clients > > and last but not least no header changes > > mail that is moved out of the junk folder is learned as ham, intuitive > like an amiga :) Yes but worst scenario is best for me. POP users with MS outlook. Then I was wondering to add with altermime somethin like this at footer: "if you think this mail is spam please click here" (also for ham), and "here" is a link with message-id (i have a CC of all mails). So, other doutbt, altering mail by adding a footer will alter SA learning?
Re: Mailbox for auto learning
On Sun, 9 Aug 2009 00:36:54 -0500, Luis Daniel Lucio Quiroz > 1. Will forwarded mails be usefull for training, I mean if spam was: From: > spa...@example.netTo: u...@mydomain, when forwarding it will be > From: > mu...@mydomain To: s...@antispamserver. Change of this and forwarding > (getting rid of headers because mail-clients) wont change learning? > > 2. If technique in question 1 is usless, what other way would be nice to > let > user to report a false positive/negative for training. dovecot-antispam solves it with dovecot all users need to do is move mail in imap to junk folder, in that task dovecot-antispam call sa-learn this means no junk plugins to windows clients and last but not least no header changes mail that is moved out of the junk folder is learned as ham, intuitive like an amiga :) -- Benny Pedersen
Re: Mailbox for auto learning
On Sun, 9 Aug 2009 00:36:54 -0500 Luis Daniel Lucio Quiroz wrote: > Hi SAs, > > Well, after reading this link > http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still > looking for an easy-way to let my mortal users to train our > antispam. If your users use webmail, imap etc , the most convenient approach is to have folders for learning spam and ham.
Re: Mailbox for auto learning
Le dimanche 9 août 2009 06:52:49, vous avez écrit : > Luis Daniel Lucio Quiroz wrote: > > Hi SAs, > > > > Well, after reading this link > > http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still > > looking for an easy-way to let my mortal users to train our antispam. I > > was thinking a mailbox such as h...@antispamserver and > > s...@antispamserver to let users to forward their false positivos or > > their false netgatives. In isde each box (ham or spam), of course a > > procmail with sa-learn input will be forwarded. > > > > My doubts are nexts: > > 1. Will forwarded mails be usefull for training, I mean if spam was: > > From: spa...@example.netTo: u...@mydomain, when forwarding it will > > be From: mu...@mydomain To: s...@antispamserver. Change of this and > > forwarding (getting rid of headers because mail-clients) wont change > > learning? > > Forwarded mails are NOT useful. > > You also neglected to mention the change of Received headers, and pretty > much every header in the message, the re-encoding of the body by your > mail client, etc. > > Since SA's bayes tokenizes headers, that's disastrous. > > > 2. If technique in question 1 is usless, what other way would be nice to > > let user to report a false positive/negative for training. > > In some cases you can have the client forward as attachment, and use a > mailbox that strips attachments and feeds them to sa-learn. As long as > the client being used forwards the entire original message, with > complete headers, this should work fine. > > > TIA > > > > LD I understand and if I use altemime to add a link, to identify email in a quarantine? will tex in altermime change learning?
Re: Mailbox for auto learning
Luis Daniel Lucio Quiroz wrote: > Hi SAs, > > Well, after reading this link > http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still looking > for an easy-way to let my mortal users to train our antispam. I was thinking > a mailbox such as h...@antispamserver and s...@antispamserver to let users > to > forward their false positivos or their false netgatives. In isde each box > (ham or spam), of course a procmail with sa-learn input will be forwarded. > > My doubts are nexts: > 1. Will forwarded mails be usefull for training, I mean if spam was: From: > spa...@example.netTo: u...@mydomain, when forwarding it will be From: > mu...@mydomain To: s...@antispamserver. Change of this and forwarding > (getting rid of headers because mail-clients) wont change learning? > Forwarded mails are NOT useful. You also neglected to mention the change of Received headers, and pretty much every header in the message, the re-encoding of the body by your mail client, etc. Since SA's bayes tokenizes headers, that's disastrous. > 2. If technique in question 1 is usless, what other way would be nice to let > user to report a false positive/negative for training. In some cases you can have the client forward as attachment, and use a mailbox that strips attachments and feeds them to sa-learn. As long as the client being used forwards the entire original message, with complete headers, this should work fine. > > > TIA > > LD > > >
Mailbox for auto learning
Hi SAs, Well, after reading this link http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still looking for an easy-way to let my mortal users to train our antispam. I was thinking a mailbox such as h...@antispamserver and s...@antispamserver to let users to forward their false positivos or their false netgatives. In isde each box (ham or spam), of course a procmail with sa-learn input will be forwarded. My doubts are nexts: 1. Will forwarded mails be usefull for training, I mean if spam was: From: spa...@example.netTo: u...@mydomain, when forwarding it will be From: mu...@mydomain To: s...@antispamserver. Change of this and forwarding (getting rid of headers because mail-clients) wont change learning? 2. If technique in question 1 is usless, what other way would be nice to let user to report a false positive/negative for training. TIA LD
Re: Spam auto-learning by "message resending"
Jerome Delamarche wrote: Hi, I'm configuring SA and I'm looking for an easy way for the end users to improve their own Bayesian filters. Users do not have interactive account on the Linux servers. They cannot use "sa-learn" or any other Linux tools. It could be fine if they could automatically resend to their own mailbox spams not been filtered by SA. SA could (?) determine it has already analyzed the message and automatically consider it was a previous spam. Then it could use the "auto-learn" feature to add it to the user spam database ? Or is there another way to do it ? If your users can use IMAP, you can create a special folder where they copy spam messages. The Linux server can sa-learn from these folders. Or, you can use a system on the Linux server, such as Maia Mailguard, that temporarily stores all message on the server and provides a web-interface for user training. Another option is to provide a special address that users forward spam messages to. The main problem here is that the message must be forwarded as an attachment in a way that a script on the Linux server can extract the attachment and get something reasonably close to the original spam. Thunderbird does a pretty good job with this. Outlook, not so much. -Stuart
Re: Spam auto-learning by "message resending"
Jerome Delamarche wrote: Hi, I'm configuring SA and I'm looking for an easy way for the end users to improve their own Bayesian filters. Users do not have interactive account on the Linux servers. They cannot use "sa-learn" or any other Linux tools. It could be fine if they could automatically resend to their own mailbox spams not been filtered by SA. SA could (?) determine it has already analyzed the message and automatically consider it was a previous spam. Then it could use the "auto-learn" feature to add it to the user spam database ? Or is there another way to do it ? If your users can use IMAP, you can create a special folder where they copy spam messages. The Linux server can sa-learn from these folders. Or, you can use a system on the Linux server, such as Maia Mailguard, that temporarily stores all message on the server and provides a web-interface for user training. Another option is to provide a special address
Spam auto-learning by "message resending"
Hi, I'm configuring SA and I'm looking for an easy way for the end users to improve their own Bayesian filters. Users do not have interactive account on the Linux servers. They cannot use "sa-learn" or any other Linux tools. It could be fine if they could automatically resend to their own mailbox spams not been filtered by SA. SA could (?) determine it has already analyzed the message and automatically consider it was a previous spam. Then it could use the "auto-learn" feature to add it to the user spam database ? Or is there another way to do it ? Jerome
RE: Ham not auto-learning?
That sounds about right. I did get those thresholds from somewhere on this list, though, I believe. No biggie. Bayes has been pretty spot on so far (I can post the rules chart if anyone is interested.), so I'm pretty confident in allowing it to continue to learn. Thanks for your help. -- Matthew Yette Senior Engineer - NOC/Operations MA Polce Consulting, Inc. [EMAIL PROTECTED] 315-838-1644 (w) 315-356-0597 (f) AIM/Yahoo: MAPolceNOC MSN: [EMAIL PROTECTED] -Original Message- From: Craig McLean [mailto:[EMAIL PROTECTED] Sent: Friday, August 19, 2005 2:31 PM To: Matthew Yette Cc: users@spamassassin.apache.org Subject: Re: Ham not auto-learning? -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Matthew Yette wrote: | Running the sa-stats.pl version 0.9 that produces a chart with stats | on what rules are hit for spam and ham most frequently, I notice that | of all 13,411 autolearns performed, every one of them was for spam. | Ham has 0 messages autolearned. Wouldn't, for example, a message that | comes in and has been whitelisted (and therefore scoring ~ -100) be autolearned? | My bayes thresholds are set for 12.1 (spam) and -12.0(ham). Matthew, If I recall correctly, bayes learning thresholds are compared against a message score *before* whitelist adjustments are made, so unless a message scores -12 using just the standard rules (unlikely) it will never be learned as ham. Just set the ham threshold to 0 and you'll see any message hitting no positive scoring tests being learned as ham. Regards, Craig. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFDBiVFMDDagS2VwJ4RAkBVAJ9IHh/KpJ3uZRG+pZYQ7Mo77cPiaQCgvEOw F4d9wRpAt5ZHl2jHGfSE7RQ= =cXb8 -END PGP SIGNATURE-
Re: Ham not auto-learning?
I'm going to guess that whitelist isn't taken into consideration. -12 for autolearning of ham is pretty extreme, I'm not surprised you aren't seeing any autolearning. The default is .1 On Aug 19, 2005, at 1:24 PM, Matthew Yette wrote: Running the sa-stats.pl version 0.9 that produces a chart with stats on what rules are hit for spam and ham most frequently, I notice that of all 13,411 autolearns performed, every one of them was for spam. Ham has 0 messages autolearned. Wouldn't, for example, a message that comes in and has been whitelisted (and therefore scoring ~ -100) be autolearned? My bayes thresholds are set for 12.1 (spam) and -12.0(ham). -- Matthew Yette Senior Engineer - NOC/Operations MA Polce Consulting, Inc. [EMAIL PROTECTED] 315-838-1644 (w) 315-356-0597 (f) AIM/Yahoo: MAPolceNOC MSN: [EMAIL PROTECTED] -- Steve Martin http://www.cheezmo.com/ Smart Calibration, LLC http://www.smartcalibration.com/ The Widescreen Movie Centerhttp://www.widemovies.com/ Letterboxed Movie TV Schedule http://www.widemovies.com/lbx.html
Re: Ham not auto-learning?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Matthew Yette wrote: | Running the sa-stats.pl version 0.9 that produces a chart with stats on | what rules are hit for spam and ham most frequently, I notice that of | all 13,411 autolearns performed, every one of them was for spam. Ham has | 0 messages autolearned. Wouldn't, for example, a message that comes in | and has been whitelisted (and therefore scoring ~ -100) be autolearned? | My bayes thresholds are set for 12.1 (spam) and -12.0(ham). Matthew, If I recall correctly, bayes learning thresholds are compared against a message score *before* whitelist adjustments are made, so unless a message scores -12 using just the standard rules (unlikely) it will never be learned as ham. Just set the ham threshold to 0 and you'll see any message hitting no positive scoring tests being learned as ham. Regards, Craig. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFDBiVFMDDagS2VwJ4RAkBVAJ9IHh/KpJ3uZRG+pZYQ7Mo77cPiaQCgvEOw F4d9wRpAt5ZHl2jHGfSE7RQ= =cXb8 -END PGP SIGNATURE-
Ham not auto-learning?
Running the sa-stats.pl version 0.9 that produces a chart with stats on what rules are hit for spam and ham most frequently, I notice that of all 13,411 autolearns performed, every one of them was for spam. Ham has 0 messages autolearned. Wouldn't, for example, a message that comes in and has been whitelisted (and therefore scoring ~ -100) be autolearned? My bayes thresholds are set for 12.1 (spam) and -12.0(ham). -- Matthew Yette Senior Engineer - NOC/Operations MA Polce Consulting, Inc. [EMAIL PROTECTED] 315-838-1644 (w) 315-356-0597 (f) AIM/Yahoo: MAPolceNOC MSN: [EMAIL PROTECTED]
Re: Testing Bayes (auto)-learning
Greg Abbas wrote: >Paul Boven chello.nl> writes: > > >>Yes, they're forwarding the messages as attachements, and yes, I'm >>stripping them out of the message/rfc822 attachements before feeding >>them to Bayes. And in all the tests I've done so far this seems to work, >>but now that we've upgraded to SA3.0.2 I can't peek 'under the hood' >>anymore to see if things are still being learned as they should. >> >> > >On a related note, if I grab messages from a maildir after >spamassassin has "quarantined" them ("The original message has >been attached to this so you can view it... yadda yadda") is >sa-learn smart enough to realize that the spam is contained in >the attachment? > > sa-learn is smart enough to undo any changes made by spamassassin itself, so if you use SA to do your tagging, sa-learn will undo it prior to learning. However, if you use a tool like amavis, mimedefang, or mailscanner and use that tool's own encapsulation methods instead of SA's, then sa-learn won't undo it.
Re: Testing Bayes (auto)-learning
Paul Boven chello.nl> writes: > Yes, they're forwarding the messages as attachements, and yes, I'm > stripping them out of the message/rfc822 attachements before feeding > them to Bayes. And in all the tests I've done so far this seems to work, > but now that we've upgraded to SA3.0.2 I can't peek 'under the hood' > anymore to see if things are still being learned as they should. On a related note, if I grab messages from a maildir after spamassassin has "quarantined" them ("The original message has been attached to this so you can view it... yadda yadda") is sa-learn smart enough to realize that the spam is contained in the attachment? Or is this the same situation as a user-forward, where I would need to write something to strip it out? And as an aside, I'm curious about "peeking under the hood" too, but in my case it's because I'm curious how many messages have been trained. (In order to find out how soon the filter is going to think the corpus is large enough to start using its bayes rules.) TIA. -g.
Re: Testing Bayes (auto)-learning
Hi Daryl, everyone, Daryl C. W. O'Shea wrote: Paul Boven wrote: My problem is that I have end-users that are basically claiming 'the more I send to the relearn-address, the lower the Bayes score seems to be getting.' The included headers seem to support that claim, so I really want to dig a bit deeper into the whole setup. That there sounds like your problem. How are your users sending mail to the 'relearn address'? If they're not forwarding messages as an attachment, and you're not striping out these attached messages then it isn't going to work to your benefit, and you'll see the result you describe. Yes, they're forwarding the messages as attachements, and yes, I'm stripping them out of the message/rfc822 attachements before feeding them to Bayes. And in all the tests I've done so far this seems to work, but now that we've upgraded to SA3.0.2 I can't peek 'under the hood' anymore to see if things are still being learned as they should. Regards, Paul Boven.
Re: Testing Bayes (auto)-learning
Paul Boven wrote: My problem is that I have end-users that are basically claiming 'the more I send to the relearn-address, the lower the Bayes score seems to be getting.' The included headers seem to support that claim, so I really want to dig a bit deeper into the whole setup. That there sounds like your problem. How are your users sending mail to the 'relearn address'? If they're not forwarding messages as an attachment, and you're not striping out these attached messages then it isn't going to work to your benefit, and you'll see the result you describe. Daryl
Testing Bayes (auto)-learning
Hi everyone, There seem to be some learning-problems with our Bayes database which I'm trying to track down. Given a particular spam-message that got auto-trained as ham, then re-trained as spam, I would like to be able to do the following: 1.) Make sure whether it's in the Bayes database or not, and whether it is there as ham or as spam. I can use Berkeley's tools to dump the bayes_seen database, but often the message-ID isn't in there even though the message got learned; probably with a '@sa-generated' message-ID. Given the original message, how can I determine which Message-ID Bayes is using to keep track o fthe message? When will it accept the original Message-ID, and when will it use the generated one? How can I determine the sa-generated Message-ID without running it trough the learner again? How sensitive is the generated Message-ID to changes in Received: and other headers that happen when the mail gets returned to the learner? 2.) With the new SpamAssassin 3.0.2, I can no longer see what score a particular token has, because they are hashed. Is there an easy way to generate these hashes or is there an interface that I can use to check the score for a token? My problem is that I have end-users that are basically claiming 'the more I send to the relearn-address, the lower the Bayes score seems to be getting.' The included headers seem to support that claim, so I really want to dig a bit deeper into the whole setup. Regards, Paul Boven.
Is auto-learning working?
Hi, I'm new to spamassassin. I installed it on a Solaris 9 system, and it works fine. But there is a thing I don't understand, I configured the auto-learning, but when I run spamd it doesn't create the bayes_* files. If I run sa-learn, then the files are created. How can I know if auto-learning is working or not ? What I forgot ? My configuration : # spamd --version SpamAssassin Server version 3.0.2 running on Perl 5.8.3 # cat /etc/mail/spamassassin/local.cf required_hits 5 rewrite_header Subject SPAM report_safe 0 skip_rbl_checks 1 # Enable the Bayes system use_bayes 1 # Enable Bayes auto-learning bayes_auto_learn 1 bayes_path /etc/mail/spamassassin/bayes bayes_file_mode 0666 Thanks in advance. Greetings. -- Michel This e-mail, any attachments and the information contained therein ("this message") are confidential and intended solely for the use of the addressee(s). If you have received this message in error please send it back to the sender and delete it. Unauthorized publication, use, dissemination or disclosure of this message, either in whole or in part is strictly prohibited. ** Ce message électronique et tous les fichiers joints ainsi que les informations contenues dans ce message ( ci après "le message" ), sont confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils sont adressés. Si vous avez reçu ce message par erreur, merci de le renvoyer à son émetteur et de le détruire. Toutes diffusion, publication, totale ou partielle ou divulgation sous quelque forme que se soit non expressément autorisées de ce message, sont interdites. **
RE: Potential new auto-learning strategy
For various reasons (some political, some technical) we don't use bayes here. It can be very frustrating, but I'm sure you guys know what its like to have your hands tied by corporate wrangling. The reason I proposed a more complex logic than the one you suggest was to handle down-scoring rules that performed poorly as well as up-scoring effective rules. By using a fixed score, you run the risk either setting it too low and the system taking too long to learn, or too high (it has been demonstrated that this can cause chaotic behaviour in some systems). By using a function that calculates X based on the overall score of the message, the other rules hit, and diminished by the learn rate, the system can quickly cover the large gap, but when the distance between the two scores becomes small, the changes to the score values are appropriate small, tending the system towards stability (assume spammers don't change tactic) Should 2 particular rules occur commonly together, this would also have the effect of balancing out score changes across them both, relative to their base values. I'd like to get into doing this, but work is swamped (I don't get to play with spam all day :( ). If there are other people keen on doing this then maybe we can get a collaboration going. R From: Chris Santerre [mailto:[EMAIL PROTECTED] Sent: 02 March 2005 15:16To: Gray, Richard; users@spamassassin.apache.orgSubject: RE: Potential new auto-learning strategy There has been a lot of talk about dynamic scoring. Most people argue that Bayes is a good substitute for it already. But not if you don't use Bayes ;) I think its a worthy idea for testing. Although the logic could be fairly simple. Like using the top hitting rules script in a cron job. pulling out the N'th top rules and adding X points to them based on the hits. Thats something I've wanted to play with, but had no time. --Chris -Original Message-From: Gray, Richard [mailto:[EMAIL PROTECTED]Sent: Wednesday, March 02, 2005 7:03 AMTo: users@spamassassin.apache.orgSubject: Potential new auto-learning strategy I saw an article a while back about some DJs who were using perl as a mixing tool by writing perl code that edited itself while it ran in a loop. I thought this was kind of cool. I studied AI at university, and remember a good bit of discussion regarding feedback systems. So, to combine the two, I was thinking of how to use SA in a similar structure, and propose a dynamic weighting system for SA rules. Consider the scores that a base installation of SA gives to its rules, but when shown messages to learn from, it modifies the score weighting of the rules rather than the bayes system. I'll not throw out a discussion regarding learning rates and so, but I can imagine the logic being loosely based on how much influence the rule had on the total score, the distance of the final result from the spam/ham boundary, and the learning rate chosen by the administrator. Any feedback? R---This email from dns has been validated by dnsMSS Managed Email Security and is free from all known viruses.For further information contact [EMAIL PROTECTED] --- This email from dns has been validated by dnsMSS Managed Email Security and is free from all known viruses. For further information contact [EMAIL PROTECTED]
RE: Potential new auto-learning strategy
There has been a lot of talk about dynamic scoring. Most people argue that Bayes is a good substitute for it already. But not if you don't use Bayes ;) I think its a worthy idea for testing. Although the logic could be fairly simple. Like using the top hitting rules script in a cron job. pulling out the N'th top rules and adding X points to them based on the hits. Thats something I've wanted to play with, but had no time. --Chris -Original Message-From: Gray, Richard [mailto:[EMAIL PROTECTED]Sent: Wednesday, March 02, 2005 7:03 AMTo: users@spamassassin.apache.orgSubject: Potential new auto-learning strategy I saw an article a while back about some DJs who were using perl as a mixing tool by writing perl code that edited itself while it ran in a loop. I thought this was kind of cool. I studied AI at university, and remember a good bit of discussion regarding feedback systems. So, to combine the two, I was thinking of how to use SA in a similar structure, and propose a dynamic weighting system for SA rules. Consider the scores that a base installation of SA gives to its rules, but when shown messages to learn from, it modifies the score weighting of the rules rather than the bayes system. I'll not throw out a discussion regarding learning rates and so, but I can imagine the logic being loosely based on how much influence the rule had on the total score, the distance of the final result from the spam/ham boundary, and the learning rate chosen by the administrator. Any feedback? R---This email from dns has been validated by dnsMSS Managed Email Security and is free from all known viruses.For further information contact [EMAIL PROTECTED]
Potential new auto-learning strategy
I saw an article a while back about some DJs who were using perl as a mixing tool by writing perl code that edited itself while it ran in a loop. I thought this was kind of cool. I studied AI at university, and remember a good bit of discussion regarding feedback systems. So, to combine the two, I was thinking of how to use SA in a similar structure, and propose a dynamic weighting system for SA rules. Consider the scores that a base installation of SA gives to its rules, but when shown messages to learn from, it modifies the score weighting of the rules rather than the bayes system. I'll not throw out a discussion regarding learning rates and so, but I can imagine the logic being loosely based on how much influence the rule had on the total score, the distance of the final result from the spam/ham boundary, and the learning rate chosen by the administrator. Any feedback? R --- This email from dns has been validated by dnsMSS Managed Email Security and is free from all known viruses. For further information contact [EMAIL PROTECTED]
RE: Auto learning
Hi, required_hits 7 report_safe 0 rewrite_header Subject [SPAM] bayes_auto_learn 1 skip_rbl_checks 0 use_razor2 1 use_dcc 1 use_pyzor 0 dns_available yes I think I may have just sust this. I just found a bayes db in /home/root/.spamassassin, whereas I have been testing things logged in a root and was looking at /root/.spamassassin.It is being updated! I was running things as root, so it was picking up a different database. So now I have -rw--- 1 spamd spamd 1.3M Feb 22 15:51 auto-whitelist -rw--- 1 spamd spamd 3.6K Feb 22 15:51 bayes_journal -rw--- 1 spamd spamd 652K Feb 22 15:51 bayes_seen -rw--- 1 spamd spamd 5.3M Feb 22 15:51 bayes_toks in my /home/spamd/.spamassassin folder If I run sa-learn -D --sync --dbpath /home/spamd/.spamassassin I still see debug: bayes: 25894 tie-ing to DB file R/O /root/.spamassassin/bayes_toks debug: bayes: 25894 tie-ing to DB file R/O /root/.spamassassin/bayes_seen debug: bayes: found bayes db version 3 debug: bayes: Not available for scanning, only 0 spam(s) in Bayes DB < 200 debug: bayes: 25894 untie-ing debug: bayes: 25894 untie-ing db_toks debug: bayes: 25894 untie-ing db_seen debug: Score set 0 chosen. debug: Initialising learner debug: Syncing Bayes and expiring old tokens... debug: lock: 25894 created /home/spamd/.spamassassin/bayes.lock.localhost.localdomain.25894 debug: lock: 25894 trying to get lock on /home/spamd/.spamassassin/bayes with 0 retries debug: lock: 25894 link to /home/spamd/.spamassassin/bayes.lock: link ok debug: bayes: 25894 tie-ing to DB file R/W /home/spamd/.spamassassin/bayes_toks debug: bayes: 25894 tie-ing to DB file R/W /home/spamd/.spamassassin/bayes_seen debug: bayes: found bayes db version 3 debug: refresh: 25894 refresh /home/spamd/.spamassassin/bayes.lock debug: refresh: 25894 refresh /home/spamd/.spamassassin/bayes.lock synced Bayes databases from journal in 3 seconds: 1545 unique entries (1940 total entries) debug: refresh: 25894 refresh /home/spamd/.spamassassin/bayes.lock debug: refresh: 25894 refresh /home/spamd/.spamassassin/bayes.lock debug: Syncing complete. debug: bayes: 25894 untie-ing debug: bayes: 25894 untie-ing db_toks debug: bayes: 25894 untie-ing db_seen debug: bayes: files locked, now unlocking lock debug: unlock: 25894 unlink /home/spamd/.spamassassin/bayes.lock I don't understand that even though I specified the db path, it still has /root/./spamassassin mentioned as well. Does it try to use both databases? It seems to see both databases. I am seeing some bayes scoring now as well. If I am using sa-learn, can I just add the --dbpath /home/spamd/.spamassassin option and it should update the correct db? Thanks for all the help! > -Original Message- > From: Richard Ozer [mailto:[EMAIL PROTECTED] > Sent: 22 February 2005 15:19 > To: Paul J. Smith > Cc: users@spamassassin.apache.org > Subject: Re: Auto learning > > Can you post your local.cf? > > Paul J. Smith wrote: > > Still nothing. I set the owner on the bayes dbs to 'spamd' > which is the user the process is running under. I also set > og+rw. Left overnight, no change. Only 2 hams, depsite the > autolearn having picked loads of hams out of the feed with > 'autolearn=spam/ham'. I've just deleted the databases with > 'sa-learn --clear' the a 'sa-learn --sync' and reset the > permissons again to spamd. Still nothing is getting added > though and I can't see any error messages, even in debug mode. > > > > The output from sa-learn --sync -D is > > > > [EMAIL PROTECTED] .spamassassin]# sa-learn -D --sync > > debug: SpamAssassin version 3.0.2
RE: Auto learning
Thanks. I am running 'sa-learn' as root. But you've given me an idea. Maybe it's looking in home\spamd for them when running user that user and in /root/./spamassassin when I'm running as root? I've just specified the path to bayes in local.cf, so we'll see if that makes any difference. From: Andy Jezierski [mailto:[EMAIL PROTECTED] Sent: 22 February 2005 15:19 To: users@spamassassin.apache.org Subject: RE: Auto learning "Paul J. Smith" <[EMAIL PROTECTED]> wrote on 02/22/2005 01:41:28 AM: > Still nothing. I set the owner on the bayes dbs to 'spamd' which is > the user the process is running under. I also set og+rw. Left > overnight, no change. Only 2 hams, depsite the autolearn having > picked loads of hams out of the feed with 'autolearn=spam/ham'. > I've just deleted the databases with 'sa-learn --clear' the a 'sa- > learn --sync' and reset the permissons again to spamd. Still > nothing is getting added though and I can't see any error messages, > even in debug mode. > > The output from sa-learn --sync -D is > > [EMAIL PROTECTED] .spamassassin]# sa-learn -D --sync [snip] > debug: bayes: 25498 tie-ing to DB file R/O /root/.spamassassin/bayes_toks > debug: bayes: 25498 tie-ing to DB file R/O /root/.spamassassin/bayes_seen > debug: bayes: found bayes db version 3 > debug: bayes: Not available for scanning, only 0 spam(s) in Bayes DB < 200 [snip] > Can anyone see anything wrong with this? > > I'm starting spamd with "-d -c -m5 -H -i 0.0.0.0 -A 192.168.0.0/24 -s local5" > > Can't understand how I got 2 hams in there in the first place! > > Thanks. Are you sure you're using the correct bayes files? In the debug above, it shows the bayes files in /root/.spamassassin yet you say that you're running sa under the spamd userid. On my system, my bayes files for the spamd userid are in /home/spamd/.spamassassin. May want to check that. Andy -- No virus found in this incoming message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 266.3.0 - Release Date: 21/02/2005
RE: Auto learning
"Paul J. Smith" <[EMAIL PROTECTED]> wrote on 02/22/2005 01:41:28 AM: > Still nothing. I set the owner on the bayes dbs to 'spamd' which is > the user the process is running under. I also set og+rw. Left > overnight, no change. Only 2 hams, depsite the autolearn having > picked loads of hams out of the feed with 'autolearn=spam/ham'. > I've just deleted the databases with 'sa-learn --clear' the a 'sa- > learn --sync' and reset the permissons again to spamd. Still > nothing is getting added though and I can't see any error messages, > even in debug mode. > > The output from sa-learn --sync -D is > > [EMAIL PROTECTED] .spamassassin]# sa-learn -D --sync [snip] > debug: bayes: 25498 tie-ing to DB file R/O /root/.spamassassin/bayes_toks > debug: bayes: 25498 tie-ing to DB file R/O /root/.spamassassin/bayes_seen > debug: bayes: found bayes db version 3 > debug: bayes: Not available for scanning, only 0 spam(s) in Bayes DB < 200 [snip] > Can anyone see anything wrong with this? > > I'm starting spamd with "-d -c -m5 -H -i 0.0.0.0 -A 192.168.0.0/24 -s local5" > > Can't understand how I got 2 hams in there in the first place! > > Thanks. Are you sure you're using the correct bayes files? In the debug above, it shows the bayes files in /root/.spamassassin yet you say that you're running sa under the spamd userid. On my system, my bayes files for the spamd userid are in /home/spamd/.spamassassin. May want to check that. Andy
Re: Auto learning
-ing to DB file R/W /root/.spamassassin/bayes_toks debug: bayes: 25498 tie-ing to DB file R/W /root/.spamassassin/bayes_seen debug: bayes: found bayes db version 3 debug: refresh: 25498 refresh /root/.spamassassin/bayes.lock debug: Syncing complete. debug: bayes: 25498 untie-ing debug: bayes: 25498 untie-ing db_toks debug: bayes: 25498 untie-ing db_seen debug: bayes: files locked, now unlocking lock debug: unlock: 25498 unlink /root/.spamassassin/bayes.lock Can anyone see anything wrong with this? I'm starting spamd with "-d -c -m5 -H -i 0.0.0.0 -A 192.168.0.0/24 -s local5" Can't understand how I got 2 hams in there in the first place! Thanks. -Original Message- From: Richard Ozer [mailto:[EMAIL PROTECTED] Sent: 21 February 2005 21:58 To: Paul J. Smith Cc: users@spamassassin.apache.org Subject: Re: Auto learning I had a similar issue and noticed that my bayes database files did not have the proper owner or permissions. That prevented auto learning from functioning. RO Paul J. Smith wrote: Still setting up spamassassin. I've got it running and auto learning is enabled. It's been running all yesterday and over night. I can see it has tried to auto learn a lot of ham/spam and I've fed it a load of spam as well. Bayes doesn't seem to have kicked in though and if I do a sa-learn --sync -D I can see there are only 2 hams in there debug: bayes: 6344 tie-ing to DB file R/O /root/.spamassassin/bayes_toks debug: bayes: 6344 tie-ing to DB file R/O /root/.spamassassin/bayes_seen debug: bayes: found bayes db version 3 debug: bayes: Not available for scanning, only 2 ham(s) in Bayes DB < 200 debug: bayes: 6344 untie-ing debug: bayes: 6344 untie-ing db_toks debug: bayes: 6344 untie-ing db_seen debug: Score set 0 chosen. debug: Initialising learner It's definately autolearned far more than this. Does it not show here? Do I just have to wait longer or are they being stored somwhere waiting for me to sa-learn them? I'm using spamd 3.0.2 remotely. Thanks.
RE: Auto learning
s debug: bayes: 25498 tie-ing to DB file R/W /root/.spamassassin/bayes_seen debug: bayes: found bayes db version 3 debug: refresh: 25498 refresh /root/.spamassassin/bayes.lock debug: Syncing complete. debug: bayes: 25498 untie-ing debug: bayes: 25498 untie-ing db_toks debug: bayes: 25498 untie-ing db_seen debug: bayes: files locked, now unlocking lock debug: unlock: 25498 unlink /root/.spamassassin/bayes.lock Can anyone see anything wrong with this? I'm starting spamd with "-d -c -m5 -H -i 0.0.0.0 -A 192.168.0.0/24 -s local5" Can't understand how I got 2 hams in there in the first place! Thanks. -Original Message- From: Richard Ozer [mailto:[EMAIL PROTECTED] Sent: 21 February 2005 21:58 To: Paul J. Smith Cc: users@spamassassin.apache.org Subject: Re: Auto learning I had a similar issue and noticed that my bayes database files did not have the proper owner or permissions. That prevented auto learning from functioning. RO Paul J. Smith wrote: > Still setting up spamassassin. I've got it running and auto learning is > enabled. It's been running all yesterday and over night. I can see it > has tried to auto learn a lot of ham/spam and I've fed it a load of spam > as well. Bayes doesn't seem to have kicked in though and if I do a > sa-learn --sync -D I can see there are only 2 hams in there > > debug: bayes: 6344 tie-ing to DB file R/O /root/.spamassassin/bayes_toks > debug: bayes: 6344 tie-ing to DB file R/O /root/.spamassassin/bayes_seen > debug: bayes: found bayes db version 3 > debug: bayes: Not available for scanning, only 2 ham(s) in Bayes DB < 200 > debug: bayes: 6344 untie-ing > debug: bayes: 6344 untie-ing db_toks > debug: bayes: 6344 untie-ing db_seen > debug: Score set 0 chosen. > debug: Initialising learner > > It's definately autolearned far more than this. Does it not show here? > Do I just have to wait longer or are they being stored somwhere waiting > for me to sa-learn them? I'm using spamd 3.0.2 remotely. > > Thanks. -- No virus found in this incoming message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 266.3.0 - Release Date: 21/02/2005 -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 266.3.0 - Release Date: 21/02/2005