bayes,imp and virtual users
Hi, I'm using SA 3.2.5 with Horde/IMP ans Postfix 2.5.5 with virtual users. My config is similar like this: http://wiki.apache.org/spamassassin/IntegratedSpamdInPostfix I want to use bayes (SQL) auto learn with virtual users and this works as long as clients send through SMTP and an real email client. When users send through webmail (IMP) the username is not correct taken from spamc and mails gets learned by the wrong user. It seems that spamc takes the recipient of the email as sender (bayes learns spam/ham for the recipient of the mail). Config files Postfix master.cf: # Eingehende E-Mails - tux.linuxmail.at (MX) xx.xx.xx.xx:25 inet n - n - - smtpd -o content_filter=spamassassin spamassassin unix - n n - - pipe flags=Rq user=vmail argv=/usr/bin/spamc -u ${us...@${domain} -e /usr/sbin/sendmail -oi -f ${sender} ${recipient} Spamassassin local.cf: use_bayes 1 bayes_store_module Mail::SpamAssassin::BayesStore::SQL bayes_sql_dsn DBI:mysql:spamassassin:localhost bayes_sql_username user bayes_sql_password pwd bayes_auto_learn 1 spamd options (Debian-based 7etc/default/spamassassin): OPTIONS=--max-children 5 -d -q -u vmail --nouser-config --virtual-config-dir=/home/vmail/%d/%l When then a user sends an email from Webmail to a recipient I see the recipients email address with spam and ham count in my bayes MySQL DB but the mail should get learned for the user who has send the mail. Seba
blacklist_from
all, I'm trying to blacklist email frcm '*Vegas Club Casino' *Hi wich is being sent from different email adressess but always with the same 'From' in the header. Tried putting it in local.cf as blacklist_from Vegas Club Casino but those mails keep coming. How can I filter just on the from tag without using an email adress but the name. best regards, Geert PS I'm using p spamassassin-3.2.5-1.el4.rf with score 3
Re: blacklist_from
On 04.03.09 10:10, Geert Batsleer wrote: I'm trying to blacklist email frcm '*Vegas Club Casino' *Hi wich is being sent from different email adressess but always with the same 'From' in the header. Tried putting it in local.cf as blacklist_from Vegas Club Casino but those mails keep coming. If you look at the docs, you'll see that *blacklist_* only apply for adresses How can I filter just on the from tag without using an email adress but the name. a rule will be needed -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. We are but packets in the Internet of life (userfriendly.org)
Re: blacklist_from
On 3/4/2009 10:18 AM, Matus UHLAR - fantomas wrote: On 04.03.09 10:10, Geert Batsleer wrote: I'm trying to blacklist email frcm '*Vegas Club Casino' *Hi wich is being sent from different email adressess but always with the same 'From' in the header. Tried putting it in local.cf as blacklist_from Vegas Club Casino but those mails keep coming. If you look at the docs, you'll see that *blacklist_* only apply for adresses How can I filter just on the from tag without using an email adress but the name. a rule will be needed header FROM_BLAHFrom:name =~ /\bBLAH\b/i should do the trick
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Le 03/03/2009 17:42, Matus UHLAR - fantomas a écrit : I have been already thinking about possibility to combine every two rules and do a masscheck over them. Then, optionally repeating that again, skipping duplicates. Finally gather all rules that scored=0.5 ||=-0.5 - we could have interesting ruleset here. But that's going to be a HUGE ruleset. On Mar 3, 2009, at 10:06, John Wilcock j...@tradoc.fr wrote: Not to mention that different combinations will suit different sites. I wonder about the feasibility of a second Bayesian database, using the same learning mechanism as the current system, but keeping track of rule combinations instead of keywords. LuKreme wrote: It sounds like a really good idea to me, and also like the most reasonable way to manage self-learning meta rules. On 03.03.09 16:43, Marc Perkel wrote: It seems to me that the consensus is that it's worth a try. I don't know if it will work or not but I think there's a good change this could be a significant advancement in how well SA works. I should note that some policy rules and rules with manually updated scores (SPF_PASS, BAYES_*) may need to be exempted from this. We don't want SPF_PASS to generate high positive score, do we? -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. M$ Win's are shit, do not use it !
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
On Wed, Mar 4, 2009 at 00:43, Marc Perkel m...@perkel.com wrote: LuKreme wrote: On Mar 3, 2009, at 10:06, John Wilcock j...@tradoc.fr wrote: Le 03/03/2009 17:42, Matus UHLAR - fantomas a écrit : I have been already thinking about possibility to combine every two rules and do a masscheck over them. Then, optionally repeating that again, skipping duplicates. Finally gather all rules that scored=0.5 ||=-0.5 - we could have interesting ruleset here. But that's going to be a HUGE ruleset. Not to mention that different combinations will suit different sites. I wonder about the feasibility of a second Bayesian database, using the same learning mechanism as the current system, but keeping track of rule combinations instead of keywords. It sounds like a really good idea to me, and also like the most reasonable way to manage self-learning meta rules. It seems to me that the consensus is that it's worth a try. I don't know if it will work or not but I think there's a good change this could be a significant advancement in how well SA works. So you're volunteering to code it up, then? ;) --j.
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Le 04/03/2009 10:38, Matus UHLAR - fantomas a écrit : I should note that some policy rules and rules with manually updated scores (SPF_PASS, BAYES_*) may need to be exempted from this. We don't want SPF_PASS to generate high positive score, do we? It could probably be argued both ways. There might be advantages in letting the postulated system give a positive boost to high-confidence spam indicators even if (or perhaps particularly when) they occur in combination with rules that are low-confidence ham indicators like SPF_PASS. But I guess these sort of details would need to be investigated by whoever takes on the task of designing and coding the system. It would no doubt take some fairly complex statistical analysis of different possible strategies to implement this idea. I for one have neither the time nor the expertise, unfortunately, to do much more than express an idea! John. -- -- Over 3000 webcams from ski resorts around the world - www.snoweye.com -- Translate your technical documents and web pages- www.tradoc.fr
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Justin Mason wrote: So you're volunteering to code it up, then? ;) I was planning to do at least some brainstorming+experiements as to what learning methods would seem suitable and how well the method performs, whenever I have time again. Unless someone else did that already? smime.p7s Description: S/MIME Cryptographic Signature
Re: Bye Bye Bayes
LuKreme wrote on Tue, 3 Mar 2009 19:02:06 -0700: How is it the same? Already read messages in inbox means the user has accepted those messages without trashing them or junking them. and the message may not have been learned by score. If you can make sure that your users *really* delete or move spam to the right places, then it works, yes. But I fear there is a chance that users just walk over spam and let it stay as (depending on the mail client) it may just not be visible anymore which may be good enough for them. So, there's a chance of undesired infection with spam. False junk would get pulled out of .Junk into the inbox and relearned as ham. How? By the user? When? What about vacation? I wouldn't trust too much that users do the right thing. Depends on your user base. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com
Re: Bye Bye Bayes
On Tue, 3 Mar 2009, LuKreme wrote: On Mar 3, 2009, at 17:07, John Hardin jhar...@impsec.org wrote: On Tue, 3 Mar 2009, LuKreme wrote: I am considering the following: Autolearn read mail in the inbox as ham Autolearn mail in .Junk and .SPAM as spam This is pretty east with maildir. How is that different from using the built-in autolearning based on message score? How is it the same? Already read messages in inbox means the user has accepted those messages without trashing them or junking them. Sorry, I didn't register that part. I thought it was just messages in the inbox. Bear in mind some mail clients will mark a message read if you only highlight the title line. Auto-preview can be annoying that way sometimes. .Junk means the user, or the user's MUA, has flagged a message that is not tagged as spam. Okay, I was assuming that was your SA spam quarantine, not your equivalent of the user's spam training folder. False junk would get pulled out of .Junk into the inbox and relearned as ham. Haven't done it, still mulling. Now that you've explained it in more detail it sounds better. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Failure to plan ahead on someone else's part does not constitute an emergency on my part. -- David W. Barts in a.s.r --- 4 days until Daylight Saving Time begins in U.S. - Spring Forward
Re: Bye Bye Bayes
On Wed, 4 Mar 2009, Kai Schaetzl wrote: LuKreme wrote on Tue, 3 Mar 2009 19:02:06 -0700: How is it the same? Already read messages in inbox means the user has accepted those messages without trashing them or junking them. If you can make sure that your users *really* delete or move spam to the right places, then it works, yes. That, of course, is the crux of the biscuit. I used to have a couple of users who treated their Trash folder as long-term read-message storage. After reading most messages they'd move them to Trash, and _never_ _purge_ _it_. I couldn't break them of this habit, even after purging their Trash folder from the server a couple of times. (Oops! Disk failure! Well, that was trash, you can afford to lose that.) But I fear there is a chance that users just walk over spam and let it stay as (depending on the mail client) it may just not be visible anymore which may be good enough for them. Or delete it rather than moving it to .Junk I'll modify my earlier comment - it sounds good, assuming you have a high degree of users behaving they way you want them to. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Failure to plan ahead on someone else's part does not constitute an emergency on my part. -- David W. Barts in a.s.r --- 4 days until Daylight Saving Time begins in U.S. - Spring Forward
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Matus UHLAR - fantomas wrote: I should note that some policy rules and rules with manually updated scores (SPF_PASS, BAYES_*) may need to be exempted from this. We don't want SPF_PASS to generate high positive score, do we? The idea of all this is that we might discover things like SPF_PASS combined with other rules might be useful where by itself it's not. We might find ourselves generating more informational tokens that by themselve don't score but are useful in combination with other rules.
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Justin Mason wrote: On Wed, Mar 4, 2009 at 00:43, Marc Perkel m...@perkel.com wrote: LuKreme wrote: On Mar 3, 2009, at 10:06, John Wilcock j...@tradoc.fr wrote: Le 03/03/2009 17:42, Matus UHLAR - fantomas a écrit : I have been already thinking about possibility to combine every two rules and do a masscheck over them. Then, optionally repeating that again, skipping duplicates. Finally gather all rules that scored=0.5 ||=-0.5 - we could have interesting ruleset here. But that's going to be a HUGE ruleset. Not to mention that different combinations will suit different sites. I wonder about the feasibility of a second Bayesian database, using the same learning mechanism as the current system, but keeping track of rule combinations instead of keywords. It sounds like a really good idea to me, and also like the most reasonable way to manage self-learning meta rules. It seems to me that the consensus is that it's worth a try. I don't know if it will work or not but I think there's a good change this could be a significant advancement in how well SA works. So you're volunteering to code it up, then? ;) --j. I would if I were any good at perl.
Dealing with low scoring spam - tighter MTA integration [was: 2 + 2 != 4 - Spamassassin needs a new paradigm]
Karsten Bräckelmann guent...@rudersport.de wrote: On Tue, 2009-03-03 at 08:32 -0800, Marc Perkel wrote: Spamassassin works by adding up points. Rule A is 2 points, Rule B is 2 points therefore the score is 4 points. But is this the best way to score? I don't think so. [...] Anyhow - just throwing this out there for people to chew on and think about. Oh, and another problem with this: About 98-99% of my spam in-stream scores as high, that any such proposal results in a useless increase of the score. The problem lies with the LOW scoring spam. Alas, these do not tend to trigger on a solid subset or meta as you proposed. In particular, RBL hits are quite rare, even more so for multiple hits. The few rules hit by low scorers are quite diverse, which complicates this. May be spamassassin should create set of tests intended for use before replying RCPT TO: in SMTP session? [ test based on: sending IP address, envelope sender, envelope recipient, and name in helo/ehlo ] Possible recommended actions: accept, temporary reject, permanent reject - with choice based on spam score *AND* mail source reputation. Temporary reject in SMTP session should increase chances of DNSBL hits by reducing blind spot period of newly created spam sources. -- [plen: Andrew] Andrzej Adam Filip : a...@onet.eu The difference between science and the fuzzy subjects is that science requires reasoning while those other subjects merely require scholarship. -- Robert Heinlein
Re: Bye Bye Bayes
John Hardin wrote on Wed, 4 Mar 2009 06:17:16 -0800 (PST): (Oops! Disk failure! Well, that was trash, you can afford to lose that.) thanks for the laugh :-) Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com
Re: Dealing with low scoring spam - tighter MTA integration [was: 2 + 2 != 4 - Spamassassin needs a new paradigm]
On Wed, 2009-03-04 at 16:02 +0100, Andrzej Adam Filip wrote: Karsten Bräckelmann guent...@rudersport.de wrote: About 98-99% of my spam in-stream scores as high, that any such proposal results in a useless increase of the score. The problem lies with the LOW scoring spam. Alas, these do not tend to trigger on a solid subset or meta as you proposed. In particular, RBL hits are quite rare, even more so for multiple hits. The few rules hit by low scorers are quite diverse, which complicates this. May be spamassassin should create set of tests intended for use before replying RCPT TO: in SMTP session? [ test based on: sending IP address, envelope sender, envelope recipient, and name in helo/ehlo ] This would be an entirely different application, not SA, wouldn't it? Well, this probably could be done in SA using a multi-level protocol capable of returning values at different stages. However, this seems perfectly suited for a lightweight tool, rather than a hog that is designed to scan and process entire messages. :) Possible recommended actions: accept, temporary reject, permanent reject - with choice based on spam score *AND* mail source reputation. Temporary reject in SMTP session should increase chances of DNSBL hits by reducing blind spot period of newly created spam sources. Experience with grey-listing, tempfail or whatever varies wildly given the posts to this list. Some do report, that the zombies won't retry anyway after being tempfailed once. So a later DNSBL hit after the list catching up and DNS propagation may be even irrelevant. -- char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Bye Bye Bayes
I used to have a couple of users who treated their Trash folder as long-term read-message storage. I have a user like that at $DAYJOB. I used to ask him if he kept his car title and other important documents in the wastebasket under his desk at home. -- Dave Pooser Cat-Herder-in-Chief, Pooserville.com Sarcasm Error: Abort, Retry, Bite Me? -Legostar Galactica
Re: Dealing with low scoring spam - tighter MTA integration
Karsten Bräckelmann guent...@rudersport.de wrote: On Wed, 2009-03-04 at 16:02 +0100, Andrzej Adam Filip wrote: Karsten Bräckelmann guent...@rudersport.de wrote: About 98-99% of my spam in-stream scores as high, that any such proposal results in a useless increase of the score. The problem lies with the LOW scoring spam. Alas, these do not tend to trigger on a solid subset or meta as you proposed. In particular, RBL hits are quite rare, even more so for multiple hits. The few rules hit by low scorers are quite diverse, which complicates this. May be spamassassin should create set of tests intended for use before replying RCPT TO: in SMTP session? [ test based on: sending IP address, envelope sender, envelope recipient, and name in helo/ehlo ] This would be an entirely different application, not SA, wouldn't it? It can be developed using the same spam score logic, based subset of all tests requiring only the subset of final data available during classic run. I do think that promoting tools that encourage postmaster to care very much about mail server (IP address) reputation can make real difference e.g. caring to be above reputation none in DNSWL to avoid grey-listing. Well, this probably could be done in SA using a multi-level protocol capable of returning values at different stages. However, this seems perfectly suited for a lightweight tool, rather than a hog that is designed to scan and process entire messages. :) During initial tests/deployment *much* simpler implementation can be used with recommended action based on spam score: It would require redesign of 50_scores.cf structure. e.g. instead of score RCVD_IN_DNSWL_HI 0 -8 0 -8 something like that # N - Network, B - Bayes, nX - no X, R - RCPT TO: score RCVD_IN_DNSWL_HI nNnB=0 NnB=-8 nNB=0 NB=-8 R=-8 or shorter score RCVD_IN_DNSWL_HI N=-8 R=-8 Possible recommended actions: accept, temporary reject, permanent reject - with choice based on spam score *AND* mail source reputation. Temporary reject in SMTP session should increase chances of DNSBL hits by reducing blind spot period of newly created spam sources. Experience with grey-listing, tempfail or whatever varies wildly given the posts to this list. Some do report, that the zombies won't retry anyway after being tempfailed once. So a later DNSBL hit after the list catching up and DNS propagation may be even irrelevant. There are DUL zombies that effectively do frequent IP address hoping and static NAT zombies. The former are bigger in number, the later produce higher spam volume (IMHO). -- [plen: Andrew] Andrzej Adam Filip : a...@onet.eu All the taxes paid over a lifetime by the average American are spent by the government in less than a second. -- Jim Fiebig
Re: Bye Bye Bayes
On Wed, 2009-03-04 at 16:31 +0100, Kai Schaetzl wrote: John Hardin wrote on Wed, 4 Mar 2009 06:17:16 -0800 (PST): (Oops! Disk failure! Well, that was trash, you can afford to lose that.) thanks for the laugh :-) How many of you have seen the BOFH (Bastard Operator From Hell) stories? http://www.theregister.co.uk/odds/bofh/ They may amuse some of you... Martin
Re: Bye Bye Bayes
Karsten Bräckelmann wrote on Wed, 04 Mar 2009 02:25:51 +0100: That's bayes_auto_learn_threshold_spam and nonspam respectively, I guess? Keep in mind that threshold is not the actual score, so you aren't learning all spam with a score of 8+ then. Right, I know. That's where the spam quarantine comes into play. All spam in it (= everything with score 5 or higher) gets learned in the night. That's absolutely necessary as we don't get much spam. 96% of the mail that is accepted is ham (or spam that comes in because the user opted out, there's no distinction because there's no detection). The remainder is either a virus or other bad content or High Scoring spam. Low scoring spam is almost non-existent. Kai, given a nonspam threshold of -2, how exactly do you (manually) learn ham? That would be interesting. And what's the ham/spam ratio? I just checked and have to admit we must have removed the bayes_auto_learn_threshold_ham -2 some time ago as 0.01 seems to be reliable enough. Only the bayes_auto_learn_threshold_spam 8 is in effect now. But I believe -2 would also deliver enough ham for autolearning. Score distribution of the last 40.000 or so messages on the same server. -15 6 -4 3,364 -3 4,249 -2 9,982 -1 4,760 0 13,995 1 1,267 2 789 3 387 Bayes from that machine: 0.000 0 66285 0 non-token data: nspam 0.000 0 85888 0 non-token data: nham 0.000 01864402 0 non-token data: ntokens As you see, because of the structure of the incoming mail, the ham exceeds the spam and the gap is probably steadily growing. This is also reflected in the rule hits. The no. 1 rule that hits is Bayes_00 (it hits 99.7% of all ham). Bayes_99 is only at around position 25, but with a 100% accuracy and the no. 1 rule hitting spam (hitting about 50% of spam). On servers where I get in some spam trap email and let part of it flow thru the MTA rejection the picture is very different. For instance the server for my own domains has only 25% ham. Bayes_99 is the no. 1 hitting rule with an accuracy of 95.8% (again, not checked if the remainder really was ham). With all the URIBL rules and BAYES_00 (accuracy 99.9%) as runners up. So, all in all Bayes works very much for me. Especially in those cases where no other rule hits (typically some spamvertized site not yet on a URIBL) it's most often the only rule that hits. That's why I moved it to 5.0 a while ago. Works very well. I think if you use DCC or Razor you may get similar results for these rules and may not need to rely so much on Bayes. I do not use *any* network rules except for the URIBL stuff which isn't shut off by skip_rbl_checks 1. (Figures are taken from mailwatch rule hits tables.) Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Marc Perkel wrote: Justin Mason wrote: So you're volunteering to code it up, then? ;) --j. I would if I were any good at perl. I think we should evaluate if the suggested technique works and performs better or is at least of some benefit, before trying to implement it properly as a plugin. Such a test can be done offline with spam/ham easily... I started writing a script that mines some of my spam and ham, and then I'll evaluate how good the classifiers are that I get. Cheers, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
On 4-Mar-2009, at 02:38, Matus UHLAR - fantomas wrote: LuKreme wrote: It sounds like a really good idea to me, and also like the most reasonable way to manage self-learning meta rules. On 03.03.09 16:43, Marc Perkel wrote: It seems to me that the consensus is that it's worth a try. I don't know if it will work or not but I think there's a good change this could be a significant advancement in how well SA works. I should note that some policy rules and rules with manually updated scores (SPF_PASS, BAYES_*) may need to be exempted from this. We don't want SPF_PASS to generate high positive score, do we? You're still thinking linearly. It's not a matter of giving SPF_PASS a high score, it's a matter of looking at ham and noticing that SPF_PASS and Bayes_00 and RANDTEST_08 have a ham index of 4% and a spam index of 0.02% and modifying mail that hits those three test with a +2.0 to the overall score. It seems to me that from a programming standpoint, not much needs to be done. All we need is a second bayes instance that ONLY looks at the X-Spam-Status line and a news status line X-Spam-Passed which SA gets patched to add as well. Let bayes run over a large corpus of mail looking at those two headers and see if it's useful. -- My mind is going. There is no question about it. I can feel it. I can feel it. I can feel it. I'm... afraid.
Re: Dealing with low scoring spam - tighter MTA integration
On Wed, 4 Mar 2009, Andrzej Adam Filip wrote: This would be an entirely different application, not SA, wouldn't it? It can be developed using the same spam score logic, based subset of all tests requiring only the subset of final data available during classic run. So in other words something like SMTP-time DNSBL tests that score points towards rejection rather than being pass/fail? That sounds like a good idea. I do think that promoting tools that encourage postmaster to care very much about mail server (IP address) reputation can make real difference e.g. caring to be above reputation none in DNSWL to avoid grey-listing. Agreed. But, performing major redesign of SA to achieve this pre-RCPT is going to be a tough sell. Well, this probably could be done in SA using a multi-level protocol capable of returning values at different stages. However, this seems perfectly suited for a lightweight tool, rather than a hog that is designed to scan and process entire messages. :) During initial tests/deployment *much* simpler implementation can be used with recommended action based on spam score: It would require redesign of 50_scores.cf structure. e.g. instead of score RCVD_IN_DNSWL_HI 0 -8 0 -8 something like that # N - Network, B - Bayes, nX - no X, R - RCPT TO: score RCVD_IN_DNSWL_HI nNnB=0 NnB=-8 nNB=0 NB=-8 R=-8 or shorter score RCVD_IN_DNSWL_HI N=-8 R=-8 Why would SA be served by _major_ modifications like this, rather than writing a new milter that focuses on determining the reputation of an IP? Are you really willing to break _all_ existing SA installations for this? Please don't try to make SA a do everything tool, you'll likely weaken what it does an outstanding job of today. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Failure to plan ahead on someone else's part does not constitute an emergency on my part. -- David W. Barts in a.s.r --- 4 days until Daylight Saving Time begins in U.S. - Spring Forward
Re: Dealing with low scoring spam - tighter MTA integration
John Hardin jhar...@impsec.org wrote: On Wed, 4 Mar 2009, Andrzej Adam Filip wrote: This would be an entirely different application, not SA, wouldn't it? It can be developed using the same spam score logic, based subset of all tests requiring only the subset of final data available during classic run. So in other words something like SMTP-time DNSBL tests that score points towards rejection rather than being pass/fail? That sounds like a good idea. I do think that promoting tools that encourage postmaster to care very much about mail server (IP address) reputation can make real difference e.g. caring to be above reputation none in DNSWL to avoid grey-listing. Agreed. But, performing major redesign of SA to achieve this pre-RCPT is going to be a tough sell. Well, this probably could be done in SA using a multi-level protocol capable of returning values at different stages. However, this seems perfectly suited for a lightweight tool, rather than a hog that is designed to scan and process entire messages. :) During initial tests/deployment *much* simpler implementation can be used with recommended action based on spam score: It would require redesign of 50_scores.cf structure. e.g. instead of score RCVD_IN_DNSWL_HI 0 -8 0 -8 something like that # N - Network, B - Bayes, nX - no X, R - RCPT TO: score RCVD_IN_DNSWL_HI nNnB=0 NnB=-8 nNB=0 NB=-8 R=-8 or shorter score RCVD_IN_DNSWL_HI N=-8 R=-8 Why would SA be served by _major_ modifications like this, rather than writing a new milter that focuses on determining the reputation of an IP? Are you really willing to break _all_ existing SA installations for this? Please don't try to make SA a do everything tool, you'll likely weaken what it does an outstanding job of today. 0) Such _major_ modification means introducing it in next _major_ spamassassin release unless it can be made downward compatible e.g. by using *separate* score file for at RCPT TO: tests. Where there's a Will, there's a way 1) I want milter(s) (MIMEDefang's filtering script in perl) to use spamassassin in such role. I personally prefer such tools from teams with well established maintenance reputation. I also believe that SA score tuning methodology would fit very well too. 2) Anyway limiting scores to *only* four cases *SHOULD NOT* stay forever. -- [plen: Andrew] Andrzej Adam Filip : a...@onet.eu All the people are so happy now, their heads are caving in. I'm glad they are a snowman with protective rubber skin -- They Might Be Giants
Re: Bye Bye Bayes
On 4-Mar-2009, at 07:06, John Hardin wrote: On Tue, 3 Mar 2009, LuKreme wrote: On Mar 3, 2009, at 17:07, John Hardin jhar...@impsec.org wrote: On Tue, 3 Mar 2009, LuKreme wrote: I am considering the following: Autolearn read mail in the inbox as ham Autolearn mail in .Junk and .SPAM as spam This is pretty east with maildir. How is that different from using the built-in autolearning based on message score? How is it the same? Already read messages in inbox means the user has accepted those messages without trashing them or junking them. Sorry, I didn't register that part. I thought it was just messages in the inbox. Bear in mind some mail clients will mark a message read if you only highlight the title line. Auto-preview can be annoying that way sometimes. Yep, and I think THAT has caused me to decide against doing this. Instead I have changed to thinking about having it autolearn as ham messages that are read and are NOT in .Junk* .SPAM* /cur /new or .Trash* -- but again, just mulling it over. .Junk means the user, or the user's MUA, has flagged a message that is not tagged as spam. Okay, I was assuming that was your SA spam quarantine, not your equivalent of the user's spam training folder. I believe both the mozilla email programs (Tbird, Netscrape, Postbox) and Apple Mail.app use Junk for messages the MUAs think are spammish, not sure about any other clients. Our SA spam quarantine is .SPAM False junk would get pulled out of .Junk into the inbox and relearned as ham. Haven't done it, still mulling. Now that you've explained it in more detail it sounds better. Better, but not good, perhaps. I've half a mind to simply forget auto- learning for the virtual users completely and make them use sa-ham sa- spam to manually train, and if they don't? Yeah, too bad. OTOH, I'm a little tired and cranky today... -- But just because you've seen me on your TV Doesn't mean I'm any more enlightened than you
Please remove li...@billmerriam.com
Could the list-admin please remove li...@billmerriam.com from the list as it bounces all messages back to the original sender? Thanks! (BTW: there's no list-admin address published anywhere.) Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com
SpamAssassin Doesn't Appear to be working
I have a freebsd 7.0 RC3 server running postfix amavisd-new clamavd and SpamAssassin... Having just upgraded ports I believe they are all current releases In this set up I am lead top believe that Amavisd-new handles the SA config but I did not see a process for spamd so i enabled in rc.conf.I am not seeing any x-spam related headers in the long message header but the GTUBE test message was discarded so it appears to be working my amavisd config for SA is $sa_tag_level_deflt = 2.0; # add spam info headers if at, or above that level $sa_tag2_level_deflt = 6.31; # add 'spam detected' headers at that level $sa_kill_level_deflt = 4.0; # triggers spam evasive actions $sa_dsn_cutoff_level = 10; # spam level beyond which a DSN is not sent # $sa_quarantine_cutoff_level = 20; # spam level beyond which quarantine is off # $penpals_bonus_score = 5; # (no effect without a @storage_sql_dsn database) # $penpals_threshold_high = $sa_kill_level_deflt; # don't waste time on hi spam $sa_mail_body_size_limit = 400*1024; # don't waste time on SA if mail is larger $sa_local_tests_only = 0;# only tests which do not require internet access? $sa_spam_subject_tag = '***SPAM*** '; # SpamAssassin settings @bypass_banned_checks_maps = ( ['.sanddollarbonaire.com','habitatbonaire.com', '.constantcontact.com','.att.net'] ); # $sa_local_tests_only is passed to Mail::SpamAssassin::new as a value # of the option local_tests_only. See Mail::SpamAssassin man page. # If set to 1, no SA tests that require internet access will be performed. # $sa_local_tests_only = 0; # only tests which do not require internet access? #$sa_auto_whitelist = 1;# turn on AWL in SA 2.63 or older (irrelevant # for SA 3.0, its cf option is use_auto_whitelist) any thoughts as to what happened to the headers and should i even care? thanks jason -- View this message in context: http://www.nabble.com/SpamAssassin-Doesn%27t-Appear-to-be-working-tp22341459p22341459.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: SpamAssassin Doesn't Appear to be working
Jason, I have a freebsd 7.0 RC3 server running postfix amavisd-new clamavd and SpamAssassin... Having just upgraded ports I believe they are all current releases In this set up I am lead top believe that Amavisd-new handles the SA config but I did not see a process for spamd so i enabled in rc.conf. There is no need for a spamd process in this setup - think of amavisd proces as an equivalent of spamd (in that it calls a SpamAssassin library of perl modules), but speaks a different protocol: amavisd speaks SMTP, spamd speaks spamc/spamd protocol. I am not seeing any x-spam related headers in the long message header but the GTUBE test message was discarded so it appears to be working any thoughts as to what happened to the headers and should i even care? The most likely reason for absence of X-Spam-* header fields is that the recipient was not considered local - check your setting of @local_domains_maps (or %local_domains or @local_domains_acl). X-Spam-* header fields are not inserted for outbound mail (i.e. when recipient is not considered local). Check the log (possibly at elevated log level) to make sure. Mark
Re: More Google group messages
Kai Schaetzl wrote: Albert E. Whale wrote on Sun, 01 Mar 2009 12:51:36 -0500: Our Email is Filtered using the SPAMZapper. More than 50 Million connections filtered and still counting. SPAMZapper stops Spam, Viruses, and other Malware before it reaches your PC. Doesn't this ultimate tool stop it? Kai Thank Kai! Yes, it does a VERY Good job of stopping them, but like everything else, it requires some maintenance, which is what we provide for our customers. Thank you! Our Email is Filtered using the SPAMZapper. More than 50 Million connections filtered and still counting. SPAMZapper stops Spam, Viruses, and other Malware before it reaches your PC.
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
decoder wrote: Justin Mason wrote: So you're volunteering to code it up, then? ;) I was planning to do at least some brainstorming+experiements as to what learning methods would seem suitable and how well the method performs, whenever I have time again. Unless someone else did that already? Ok, I did some short experiments: I've built an SVM classifier from a large mail corpus (8226 mails (5414 ham, 2812 spam)) and did a 5-fold cross validation. The resulting classifier has an accuracy of over 99%, so performs as good as the regular system. Now I applied this to a set of 202 False Negatives that I collected, and 69 of these are recognized as spam by the SVM. As a second test, I pulled 2707 mails from one of my other inboxes and applied the classifier, the accuracy was again over 99% (and this is only ham). From my point of view, the results show that this approach has potential. It is highly accurate with respect to the current system, but additionally outperformed it on several false negatives. There are other advantages that this system has over the common system: It allows everybody to train the whole spamfilter (not only Bayes) to the kind of spam that one receives, i.e. it is more adaptive than the common system. Any opinions on this are greatly welcome. Maybe we should try to come up with a proof of concept plugin for SA? Best regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: Dealing with low scoring spam - tighter MTA integration [was: 2 + 2 != 4 - Spamassassin needs a new paradigm]
At 07:02 04-03-2009, Andrzej Adam Filip wrote: May be spamassassin should create set of tests intended for use before replying RCPT TO: in SMTP session? [ test based on: sending IP address, envelope sender, envelope recipient, and name in helo/ehlo ] SpamAssassin processes the message and returns the result. The way it is designed, it can be integrated in different environments as it is MTA agnostic. The change you propose could be done by introducing a new command in the protocol to evaluate the envelope information only. It would be easier to do all that through a milter as there is less overhead. The downside is that you will get more false positives. Regards, -sm
Re: Dealing with low scoring spam - tighter MTA integration [was: 2 + 2 != 4 - Spamassassin needs a new paradigm]
--On Wednesday, March 04, 2009 4:02 PM +0100 Andrzej Adam Filip a...@onet.eu wrote: May be spamassassin should create set of tests intended for use before replying RCPT TO: in SMTP session? Check out http://mimedefang.org/ MIMEDefang includes SA integration.
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
decoder wrote: decoder wrote: Justin Mason wrote: So you're volunteering to code it up, then? ;) I was planning to do at least some brainstorming+experiements as to what learning methods would seem suitable and how well the method performs, whenever I have time again. Unless someone else did that already? Ok, I did some short experiments: I've built an SVM classifier from a large mail corpus (8226 mails (5414 ham, 2812 spam)) and did a 5-fold cross validation. The resulting classifier has an accuracy of over 99%, so performs as good as the regular system. Now I applied this to a set of 202 False Negatives that I collected, and 69 of these are recognized as spam by the SVM. As a second test, I pulled 2707 mails from one of my other inboxes and applied the classifier, the accuracy was again over 99% (and this is only ham). From my point of view, the results show that this approach has potential. It is highly accurate with respect to the current system, but additionally outperformed it on several false negatives. There are other advantages that this system has over the common system: It allows everybody to train the whole spamfilter (not only Bayes) to the kind of spam that one receives, i.e. it is more adaptive than the common system. Any opinions on this are greatly welcome. Maybe we should try to come up with a proof of concept plugin for SA? Good work so far but sounds like you need to throw more data at it. Also even though you indicate over 99% accuracy can you break that down better? 99.9% is 10 times as accurate as 99%. Also - when it identifies messages do the numbers on the spam scores go up and ham goes down? If so that makes it more solid and starves the middle. I'm encouraged that the initial results are good. My feeling is that if this works that it will work better if we have more informational tokens. For example - is the from address a freemail address. Does the message contain a freemail address. By themselves these wouldn't score points. But spam coming from yahoo, hotmail, gmail, etc. is a different kind of spam than spam coming from spambots. Maybe country tokens from the received lines would be useful. Maybe names of banks in the message would be useful. For example Bank of America + Nigeria = spam. I'm really glad you're junking on this. I think it will be a breakthrough. Some of these tokens
Re: Dealing with low scoring spam - tighter MTA integration
Kenneth Porter sh...@sewingwitch.com wrote: --On Wednesday, March 04, 2009 4:02 PM +0100 Andrzej Adam Filip a...@onet.eu wrote: May be spamassassin should create set of tests intended for use before replying RCPT TO: in SMTP session? Check out http://mimedefang.org/ MIMEDefang includes SA integration. I know MIMEDefang and I use it on one installation. What I would like to see is a option to make spam assassin to produce weighted scores based on subset of all tests capable to work on subset of the final data available *before* message headersbody are transfered in SMTP session. -- [plen: Andrew] Andrzej Adam Filip : a...@onet.eu Treaties are like roses and young girls -- they last while they last. -- Charles DeGaulle