Training spam as ham and forwarding
Hi SA users, I have a few messages found in the quarantine that I need to train as ham because they were marked as spam incorrectly. To do this, I added the following to the top of the file so it becomes a normal email: From DUMMY-LINE Thu Jan 1 00:00:00 1970 Is this correct? (without the leading spaces) I can now accurately access and index it using pine, whereas before it didn't acknowledge it as a normal email. I'd also now like to forward it to the intended recipient as an attachment, but the recipient isn't able to read it as a normal email, but instead as plain text. How can I accomplish this? Are there mail tools, like procmail or formail, I believe, that were designed to automate this? Does anyone request ham from their users to be trained by bayes, or is autolearning typically the only way (or only real effective way) to do this? Also, on another note, how can I have all email destined for a particular user sent to them, including spam? This is what all_spam_to is for, correct? Thanks, Alex
Re: corpus ham/spam balance
On 26-Aug-2009, at 10:53, Kris Deugau wrote: If you're running a sitewide AWL on any kind of scale beyond a few tens of domains, and a couple hundred accounts, you should probably look at putting it in SQL - it's a *lot* easier to maintain there. Is there a good writeup on doing this?
RE: corpus ham/spam balance
On Wed, 2009-08-26 at 13:47 -0600, Savoy, Jim wrote: > > Karsten wrote: > > None. As I mentioned earlier today on this list, auto-learning does > > neither take Bayes nor AWL into account. > > Ok thanks Karsten. I guess the change to -3.0 for ham is the only cause > of my corpus coming into balance. Good to know. Yup, definitely. Also, I do agree with the post by RW. By lowering the auto-learn ham threshold you managed to get the ratio more sane. However, continuing to do so you won't really learn any ham, but spam only. Raising the threshold back to the default likely would be a good idea, and occasionally lower to get the effect you just observed: Get the ratio back to somewhat balanced. -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: sa: lottery message scored hammy by bayes:salearn --dump magin
On Wed, 2009-08-26 at 17:28 -0400, Dennis German wrote: > Thanks for the support. Voluntary support. Please do keep the thread on-list, otherwise I'll stop voluntaring. I'm not the only one who can help you. > I do know enough not to past spam in a posting. And yet you did paste the payload of a scam. Hint, the SOUGHT (FRAUD) rule-set, which I feed myself, is designed to catch this sort of bare text. > I've been trying to get our local support(midphase) to > handle some of the problems they have created by moving us to > another server. Recall my question late last night ifm you got it. > > Anyway, > I haven't been manually doing training. > > Should I be doing training? > > Also did you notice error "could not find site rules directory" Yes, I did. Quite odd. Part of the reason I pointed out your setup, and wondering out aloud which user is scanning and which one you dumped the sa-learn stats for. Since you have not been training manually, these should be the same. As I have pointed out a couple times before, it is rather crucial to manually train on *low* scoring spam, e.g. exactly like this one. They are not auto-trained. > My user_prefs is always available at > http://www.real-world-systems.com/mail/user_prefs.cgi > > > Karsten Bräckelmann wrote: > > On Tue, 2009-08-25 at 21:21 -0400, Dennis German wrote: > > > sa-learn --dump magic > > > config: could not find site rules directory > > > 0.000 0 3 0 non-token data: bayes db version > > > 0.000 0 262297 0 non-token data: nspam > > > 0.000 0 24621 0 non-token data: nham > > > 0.000 0 142776 0 non-token data: ntokens > > > > Recalling some fuzzy bits about your system and setup, I wonder if that > > is the Bayes DB of the *scanning* user -- or a different one you have > > been manually training. -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
RE: AWL q?
> I don't let that junk get past envelope stage: > > postmap -q "weekendhotdeals.info" mysql:/usr/local/etc/postfix/mysql- > from_senders_rhsbl.cf > 554 RHSBL_DOMAIN > I assume you are running some type of background process that generates the list of senders based upon some criteria. Can you share more. I also use mysql lookups for postfix (though I'm in the process of converting them to memcache for some of the larger ones (with a preloader) so I can hit memcached first (then lookup to the database after if necessary). I'm also looking for better ways to deal with spam.
Re: AWL q?
-- Original Message -- From: Gary Smith Date: Wed, 26 Aug 2009 12:29:24 -0700 >I've been finding a lot of singletons in the AWL db for domains that are all >spam. Is there a way put an entire domain into AWL or set it up to give an >average score for that domain? > >Obviously I can put this directly into the config file but I'm looking for a >less intrusive way to do this. What might be useful is an awl_domain table >that it manages the average for the domain/ip as well as just the single email. > >Anyway, is there a way to do this currently? > >Example of the database (I think I have like 500 for these guys now from this >week). > >+--++---+---+--+ >| username | email | ip| count | totscore >| >+--++---+---+--+ >| filter | ajdiohxo...@weekendhotdeals.info | 76.73 | 1 |6.519 >| >| filter | ajuxorpc...@weekendhotdeals.info | 76.73 | 1 |6.519 >| >| filter | aqxkopmj...@weekendhotdeals.info | 76.73 | 2 | 10.872 >| >| filter | atjwoxps...@weekendhotdeals.info | 76.73 | 1 | 11.918 >| >| filter | bckxiypg...@weekendhotdeals.info | 76.73 | 1 |6.519 >| >| filter | beqrikuo...@weekendhotdeals.info | 76.73 | 2 | 10.872 >| >| filter | bkqrasni...@weekendhotdeals.info | 76.73 | 2 | 13.038 >| >| filter | blyhovks...@weekendhotdeals.info | 76.73 | 1 |6.519 >| >| filter | bsfmogqa...@weekendhotdeals.info | 76.73 | 2 | 10.872 >| >| filter | bsgjuulc...@weekendhotdeals.info | 76.73 | 2 | 10.872 >| >+--++---+---+--+ I don't let that junk get past envelope stage: postmap -q "weekendhotdeals.info" mysql:/usr/local/etc/postfix/mysql-from_senders_rhsbl.cf 554 RHSBL_DOMAIN Len
AWL q?
I've been finding a lot of singletons in the AWL db for domains that are all spam. Is there a way put an entire domain into AWL or set it up to give an average score for that domain? Obviously I can put this directly into the config file but I'm looking for a less intrusive way to do this. What might be useful is an awl_domain table that it manages the average for the domain/ip as well as just the single email. Anyway, is there a way to do this currently? Example of the database (I think I have like 500 for these guys now from this week). +--++---+---+--+ | username | email | ip| count | totscore | +--++---+---+--+ | filter | ajdiohxo...@weekendhotdeals.info | 76.73 | 1 |6.519 | | filter | ajuxorpc...@weekendhotdeals.info | 76.73 | 1 |6.519 | | filter | aqxkopmj...@weekendhotdeals.info | 76.73 | 2 | 10.872 | | filter | atjwoxps...@weekendhotdeals.info | 76.73 | 1 | 11.918 | | filter | bckxiypg...@weekendhotdeals.info | 76.73 | 1 |6.519 | | filter | beqrikuo...@weekendhotdeals.info | 76.73 | 2 | 10.872 | | filter | bkqrasni...@weekendhotdeals.info | 76.73 | 2 | 13.038 | | filter | blyhovks...@weekendhotdeals.info | 76.73 | 1 |6.519 | | filter | bsfmogqa...@weekendhotdeals.info | 76.73 | 2 | 10.872 | | filter | bsgjuulc...@weekendhotdeals.info | 76.73 | 2 | 10.872 | +--++---+---+--+
RE: corpus ham/spam balance
On Wed, 2009-08-26 at 11:04 -0600, Savoy, Jim wrote: > I'm not sure how much effect adding AWL to my config helped with my > corpus coming into balance [...] None. As I mentioned earlier today on this list, auto-learning does neither take Bayes nor AWL into account. 60_awl.cf: tflags AWL userconf noautolearn -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: corpus ham/spam balance
On Wed, 26 Aug 2009 11:04:42 -0600 "Savoy, Jim" wrote: not sure how much effect adding AWL to my config helped with my > corpus coming into balance (perhaps it was only the change to my ham > threshold that made the difference), The AWL score isn't counted for autolearning and neither is the the Bayes score or whitelisting rules. So by taking the threshold down to -3.0, you wont be learning much ham at all unless you have a lot of good negative scoring custom rules. IMO trying to find a particular ham threshold that brings the ratio into balance is not a good idea because you can become far too selective in what you learn, and end-up learning the least useful candidates. Probably better to periodically push the threshold down to -100 until the numbers balance.
RE: corpus ham/spam balance
>kdg wrote: >If you're running a sitewide AWL on any kind of scale beyond a few tens of domains, and a couple hundred accounts, you should probably look at putting it in SQL - it's a *lot* easier to maintain there. It is one domain, with 20,000 accounts. I will see about using SQL. Thanks. >Well, by deleting the file, you purge all of the history various senders have acquired on your system - it may not do any harm, but it may also cause a few FPs for senders whose first message after the deletion happens to be spammier than usual. I'm not sure how much effect adding AWL to my config helped with my corpus coming into balance (perhaps it was only the change to my ham threshold that made the difference), but my thinking was that I probably wouldn't get many/any FPs if the corpus is trained well, thus allowing me to just blow that file away every few months and start over. - jim -
Re: corpus ham/spam balance
Savoy, Jim wrote: I see that my auto-whitelist file is now quite large. In just 4 months, it has grown from 0 to 335 megs. Is there a way to pare that file down? If you're running a sitewide AWL on any kind of scale beyond a few tens of domains, and a couple hundred accounts, you should probably look at putting it in SQL - it's a *lot* easier to maintain there. Beyond that, I've never seen a real solution better than the trim_whitelist script I adapted from the check_whitelist script way back with SA2.5something. I still use it on a couple of low-volume legacy domain servers to keep their BDB-based AWL from running away on me. Unfortunately I've also seen a couple of reports on this list that it tends to hog memory and will probably bog down your mail system if you run it on a large AWL file. :/ I don't recall if there was any workaround posted. Call it from cron regularly (I find daily works well). Note you'll need up to double the disk space occupied by the current file as it completely rewrites the whole thing in order to actually reduce the disk usage - simply deleting entries won't shrink the file. http://www.deepnet.cx/~kdeugau/spamtools/ Or should I maybe just stop spamd, delete the file, and restart spamd? (will that do me any harm?). Well, by deleting the file, you purge all of the history various senders have acquired on your system - it may not do any harm, but it may also cause a few FPs for senders whose first message after the deletion happens to be spammier than usual. -kgd
corpus ham/spam balance
Hi all, Back in the spring, someone mentioned that it is good to have your ham/spam ratio close to even. I have a site-wide set-up and while it seemed to be working perfectly, I did notice that when I did an "sa-learn -dump magic", my ham-to-spam ratio was almost 7::1 (2,200,000 ham to only 330,000 spam). I made two small changes to my set-up. 1) I changed the threshold for ham from -1.0 to -3.0, and I turned on auto-whitelisting. It seems to have done the trick. On April 17 my ham/spam ratio was 6.68::1 and today (Aug 26) it sits at 1.95::1! (2,900,000 ham to 1,480,000 spam). I see that my auto-whitelist file is now quite large. In just 4 months, it has grown from 0 to 335 megs. Is there a way to pare that file down? Or should I maybe just stop spamd, delete the file, and restart spamd? (will that do me any harm?). Thanks. - jim - -
Re: lottery message scored hammy by bayes
On ons 26 aug 2009 15:23:01 CEST, Karsten Bräckelmann wrote X-Spam-testscores: BAYES_00=-2.599,HTML_MESSAGE=0.001,MISSING_HEADERS=5.7, SUBJ_ALL_CAPS=3.1,UPPERCASE_75_100=1.528 That MISSING_HEADERS score is custom, and *way* off-base IMHO. question is more what header is missing -- xpoint
Re: lottery message scored hammy by bayes
On Tue, 2009-08-25 at 20:59 -0400, Dennis German wrote: > email with this content: Do *NOT* paste spam samples to the list. Use a pastebin or upload them to your own server and provide a link instead. > X-Spam-testscores: BAYES_00=-2.599,HTML_MESSAGE=0.001,MISSING_HEADERS=5.7, > SUBJ_ALL_CAPS=3.1,UPPERCASE_75_100=1.528 That MISSING_HEADERS score is custom, and *way* off-base IMHO. -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: sa: lottery message scored hammy by bayes:salearn --dump magin
On Tue, 2009-08-25 at 21:21 -0400, Dennis German wrote: > sa-learn --dump magic > config: could not find site rules directory > 0.000 0 3 0 non-token data: bayes db version > 0.000 0 262297 0 non-token data: nspam > 0.000 0 24621 0 non-token data: nham > 0.000 0 142776 0 non-token data: ntokens Recalling some fuzzy bits about your system and setup, I wonder if that is the Bayes DB of the *scanning* user -- or a different one you have been manually training. -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Auto-Learn Thresholds (was: lottery message scored hammy by bayes)
On Tue, 2009-08-25 at 22:13 -0400, Alex wrote: > > If you're using autolearning, what are your learning thresholds? > > What do you recommend for thresholds? I'm considering using > autolearning, but very concerned about corrupting the database. I > think I would use something like +15 for spam. I generally recommend the defaults, unless you *do* know you need something else. That's why they are defaults. That's <= 0.1 for ham and >= 12.0 for spam. Keep in mind these scores are calculated using a non-Bayes score set, so they generally differ from the overall score of the message. Also, this does not take various specific rules' scores into account, like Bayes and AWL. Plus some more esoteric constraints. See the docs. [1] > There are FNs on occasion in the 2.x range with low bayes numbers (or > BAYES_50) that I wouldn't want to be tagged as ham. Should that be a > concern? No. Bayes auto-learning is *not* self-feeding. Any overall score of about 2 (with Bayes) is *very* unlikely to cross either threshold when using the respective non-Bayes score-set. Moreover, your concern is with Bayes probability <= 50%, and thus a negative score for the BAYES hit. This hit is not considered for auto-learning, though, and as a first rule-of-thumb subtract that score again -- which yields a slightly higher score. Still no way even close to the thresholds. > Even mail that has been whitelisted could also contain spam, so would > a ham threshold of like -100 work, or present the same problem? 60_whitelist.cf: tflags USER_IN_WHITELIST userconf nice noautolearn Again, as per the docs [1], whitelisting will not be considered for the decision whether to auto-learn or not. guenther [1] http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}