Effectiveness of Bayes poisoning (was Re: Spam Pattern)
On Wed, 12 Feb 2014 13:11:19 -0800 (PST) John Hardin jhar...@impsec.org wrote: That only works if your hammy mail stream contains text that looks like the random garbage they put in to try to spoof bayes. Indeed. Just for kicks, I ran the OP's pastebin example through our Bayes database and it scored 99.99% likelihood of spam. The word Wopsle, for example, was a dead giveaway... that never appears in our ham stream, but has appeared in 93 spams in our database. Bayes poisoning, in our experience, is only occasionally effective. Regards, David.
Re: Trouble with bayes poisoning spam
Hi, Actually, that's a Snowshoe IP. Which, on balance, can be a good thing, slaying-wise. :) You mean that it's more likely to be added to the SBL with the other IPs in the same range sooner? Almost four years ago, I posted my approach to snowshoe slaying: http://mail-archives.apache.org/mod_mbox/spamassassin-users/200902.mbox/%3c20090204.0...@iowahoneypot.com%3e It has continued to evolve since then. Both IP block tracking and identity (Subject From.Realname) header token checking are still the two most useful approaches. I read your email from four years ago. How has it evolved? We have created a few scripts that allow you to paste a phrase from a FN into a text file, which is then generated into a rule. So Olde Brooklyn Lantern in the body would get a score, etc. Combined wit ZEN and/or SBL, and I think this is similar to what you're doing, correct? I see you have hits on RELAYCOUNTRY. If you maintain your own virtual snowshoe nations, and merge them into your real nations, while building a list of snowshoe tokens, you'll have very good success catching these. At one point I hoped I could exclude certain countries, or score some higher than others, but too much legitimate mail is received from all over the world. Got burned too many times. For example, that IP is in root eSolutions space, and they have had a snowshoe problem for at least a year and a half. Here's their ranges that I have in my small scale database: 94.242.192.0 - 94.242.255.255 188.42.0.0 - 188.42.127.255 212.117.160.0 - 212.117.191.255 Do you list them all as class C's or is there a CIDR mask that matches these? I've found many class C's in 41/8 that I'd really like to know what valid companies use this whole class A, or better isolate the class C's to block them. About two years ago, I hit a tipping point with my snowshoe IP data, and can now _VERY_ rapidly identify new blocks. I would really be interested in that, especially if it's beyond what is already available in the SBL. Both of these phrases are in my snowshoe tokens database: Classic Lantern Incredible Light How do these phrases relate to a snowshoe IP range? And one that isn't already part of the SBL? You would have to at least catch that phrase on two IPs in the same class C before you could consider it a snowshoe, correct? I checked, and one of my best data feeds was hit by the same IP block in your sample. Here are quick dumps of the contents of the identity headers: frequency and contents of Field [Subject], filtered by [all IP w/188.42.11.] A unique christmas gift for the kids A variety of medigap options explained and simplified Perhaps not to the same degree as you do, but I also have these phrases in my local database from which rules are created. Do you have a mechanism to auto-generate them? Shouldn't this be incorporated into Justin's SOUGHT rules? As soon as I've finished a couple of timesink projects, I'll start on those. - Chip Thanks, Alex
re: Trouble with bayes poisoning spam
Hi Alex! Actually, that's a Snowshoe IP. Which, on balance, can be a good thing, slaying-wise. :) Almost four years ago, I posted my approach to snowshoe slaying: http://mail-archives.apache.org/mod_mbox/spamassassin-users/200902.mbox/%3c20090204.0...@iowahoneypot.com%3e It has continued to evolve since then. Both IP block tracking and identity (Subject From.Realname) header token checking are still the two most useful approaches. I see you have hits on RELAYCOUNTRY. If you maintain your own virtual snowshoe nations, and merge them into your real nations, while building a list of snowshoe tokens, you'll have very good success catching these. For example, that IP is in root eSolutions space, and they have had a snowshoe problem for at least a year and a half. Here's their ranges that I have in my small scale database: 94.242.192.0 - 94.242.255.255 188.42.0.0 - 188.42.127.255 212.117.160.0 - 212.117.191.255 About two years ago, I hit a tipping point with my snowshoe IP data, and can now _VERY_ rapidly identify new blocks. Both of these phrases are in my snowshoe tokens database: Classic Lantern Incredible Light I checked, and one of my best data feeds was hit by the same IP block in your sample. Here are quick dumps of the contents of the identity headers: frequency and contents of Field [Subject], filtered by [all IP w/188.42.11.] A unique christmas gift for the kids A variety of medigap options explained and simplified Burn off that belly while you're sleeping Compensation information for those that suffered from mesh patch complications Endless inventory of electronics at 1/5th of what you'd pay for retail Ever wondered what it would be like to fly in a private jet? It's time you chopped that home payment in half Learn a new tongue in days Simple solutions for Medicare and Medigap Speak Japanese in two weeks Stop wasting time, start saving on your home payment We have your guide to being prepared in the event of a crisis or natural disaster You can get a Kindle Fire HD for around thirty bucks Your guide to being prepared in the event of a crisis or natural disaster frequency and contents of Field [RealnameFrom], filtered by [all IP w/188.42.11.] Adorable Santa Letters Become Multilingual Better Rates Today Gain Kowledge House Payment Halfer Lose Pounds No Gym MacBooks From 150.00 Medicare Made Simple Medigap/Medicare Explained Mesh Patch Patient Alert Private Jet Share Packages Samsung Galaxy Sold 28.54 Surgical Mesh Patch Patient Alert Your Crisis Preparation Guide When I get the time, AND some volunteers to help, I plan to publish the most statistically significant data from BOTH databases. :) Rob's Invalument data is supposed to be very helpful for snowshoe detection. Eventually, I'll get around to trying it. :) *** John: How practical would it be to create some metas that hinged off a snowshoe nation hit on RelayCountry? We'd have to define some virtual nation codes, but that's easy. I'm using a letter + number combo, since none of the official two digit country codes contain a number. That way, you and others could come up with some very nifty snowshoe focused tests, and they would ONLY trigger if the sender used a known snowshoe negligent host, AND the recipient server chose to use IP-to-Nation tests. Win-win. :) I have the naively optimistic notion that some snowshoe hosts simply do not have anti-spam expertise, and if there was a reliable library of snowshoe patterns, they might test the outgoing mail of new customers. :) This week, I posted a list of proposed 2013 projects to my volunteers, and at the top is exporting our MassCheck data for SA. Also on the list are phish and snowshoe data sharing. :) As soon as I've finished a couple of timesink projects, I'll start on those. - Chip
Trouble with bayes poisoning spam
Hi, I have an example of spam that I just can't reliably detect: http://pastebin.com/YuuLuA1x It's basically some HTML with a URL to an ad for Lantern with 9 LED bulbs. I've trained hundreds of these, and they still report BAYES_50. I've just tested it now, a few hours after having first received it, and it's already being flagged by several URIBLs and is hitting BAYES_99 since I've now trained it. I was just wondering if there was something else that could be triggered on in the header to catch these sooner? I'm assuming the sending IP part of a botnet? I'm using v3.3.2 on fc15 with amavisd. Thanks, Alex
Re: Trouble with bayes poisoning spam
On Thu, 29 Nov 2012, Alex wrote: I have an example of spam that I just can't reliably detect: http://pastebin.com/YuuLuA1x I was just wondering if there was something else that could be triggered on in the header to catch these sooner? I'm assuming the sending IP part of a botnet? I'm using v3.3.2 on fc15 with amavisd. I'm wondering why this didn't hit any rules: font-size:4px; That's too small to read and should be a good indicator of bayes poison, just like setting the font to white. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Bother, said Pooh as he struggled with /etc/sendmail.cf, it never does quite what I want. I wish Christopher Robin was here. -- Peter da Silva in a.s.r --- 26 days until Christmas
Bayes Poisoning
One of my users submitted a spam for analysis, and I was amazed at the efforts this troglodyte expended to poison bayes. Is it worth the effort to try to find huge html comments hiding junk like this? Maybe something like Rawbody OBFU_HTML_LONG_COMMENT /\--.{1024,}?--\/ Describe OBFU_HTML_LONG_COMMENT contains a ridiculously long html comment -- Daniel J McDonald, CCIE # 2495, CISSP # 78281
Re: Bayes Poisoning
On 10/18/2011 8:53 AM, Daniel McDonald wrote: One of my users submitted a spam for analysis, and I was amazed at the efforts this troglodyte expended to poison bayes. Is it worth the effort to try to find huge html comments hiding junk like this? Maybe something like Rawbody OBFU_HTML_LONG_COMMENT /\--.{1024,}?--\/ Describe OBFU_HTML_LONG_COMMENT contains a ridiculously long html comment It may be worthwhile trying to find overly-long comments, but unfortunately, it's not quite as easy as that. The problem is making sure the beginning and ending markers are part of the same comment. Your example would be tripped up if there was a small comment at the beginning of the message and another small comment at the end. It would count characters between the beginning of the first comment and the end of the second one. As far as Bayes Poisoning, I'm not sure there is any such thing. Any random text that a spammer dumps into his emails is unlikely to match the pattern of your normal emails. So just feed it to Bayes and let it do its job. Bayes works amazingly well if trained properly. :) -- Bowie
Re: Bayes Poisoning
Daniel McDonald dan.mcdon...@austinenergy.com wrote: Rawbody OBFU_HTML_LONG_COMMENT /\--.{1024,}?--\/ Describe OBFU_HTML_LONG_COMMENT contains a ridiculously long html comment Tried with exactly that limit, 1 kb. TargetX, which is used by universities in recruiting, uses a long comment in its generated mail (I did not keep a note of how many kb). Travelocity puts a 28 kb comment in confirmation messages. We were scoring 1.0 for it, and we gave up after a few more fp cases, rather than keep whitelisting. It has to do with email generated from scripts written by web designers. They're as good at email as I am at at designing web pages :-) Joseph Brennan Lead Email Systems Engineer Columbia University Information Technology
Re: Bayes Poisoning
On Tue, 2011-10-18 at 07:53 -0500, Daniel McDonald wrote: One of my users submitted a spam for analysis, and I was amazed at the efforts this troglodyte expended to poison bayes. Is it worth the effort to try to find huge html comments hiding junk like this? Hmm, wait -- Bayes and HTML comments in the same thought. Are you trying to imply the malicious Bayes tokens are inside the comment? While this kind of attack might work with other Bayesian Classifier implementations out there, it does NOT fool SA. The (body) Bayes tokens SA uses are gathered from the *rendered* body text. All HTML dropped, including comments. If you want to find out why that message has a low Bayes score, you'll have to use Template Tags to extract and investigate the tokens. Pointing at the HTML comment is a red herring. -- char *t=\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Bayes Poisoning
On 10/18/11 12:12 PM, Karsten Bräckelmann guent...@rudersport.de wrote: On Tue, 2011-10-18 at 07:53 -0500, Daniel McDonald wrote: One of my users submitted a spam for analysis, and I was amazed at the efforts this troglodyte expended to poison bayes. Is it worth the effort to try to find huge html comments hiding junk like this? Hmm, wait -- Bayes and HTML comments in the same thought. Are you trying to imply the malicious Bayes tokens are inside the comment? While this kind of attack might work with other Bayesian Classifier implementations out there, it does NOT fool SA. The (body) Bayes tokens SA uses are gathered from the *rendered* body text. All HTML dropped, including comments. Fair enough. I see that the url's in this message have been picked up by invaluement and razor, so we probably have enough points to toss it in the quarantine now anyway. -- Daniel J McDonald, CCIE # 2495, CISSP # 78281
RetrunPath and Bayes Poisoning
Are there any internal checks that disable Bayes autolearn when these artificial whitelist rules match? I'd disabled these rules in versions prior to 3.3.0 but, with all the discussion on the matter, I thought I'd leave them in to see the new and improved version. Unfortunately, I'm still seeing false positives and am concerned that they are pushing the scores low enough to poison my Bayes database. /Jason smime.p7s Description: S/MIME Cryptographic Signature
Re: RetrunPath and Bayes Poisoning
On 2/23/10 9:03 AM, Jason Bertoch wrote: Are there any internal checks that disable Bayes autolearn when these artificial whitelist rules match? I'd disabled these rules in versions prior to 3.3.0 but, with all the discussion on the matter, I thought I'd leave them in to see the new and improved version. Unfortunately, I'm still seeing false positives and am concerned that they are pushing the scores low enough to poison my Bayes database. you can edit the tflags and add noautolearn example: 72_active.cf:tflags__RCVD_IN_DNSWLnice net becomes: 72_active.cf:tflags__RCVD_IN_DNSWLnice net noautolearn -- Michael Scheidell, CTO Phone: 561-999-5000, x 1259 *| *SECNAP Network Security Corporation * Certified SNORT Integrator * 2008-9 Hot Company Award Winner, World Executive Alliance * Five-Star Partner Program 2009, VARBusiness * Best Anti-Spam Product 2008, Network Products Guide * King of Spam Filters, SC Magazine 2008 __ This email has been scanned and certified safe by SpammerTrap(r). For Information please see http://www.secnap.com/products/spammertrap/ __
Re: RetrunPath and Bayes Poisoning
Michael Scheidell wrote: On 2/23/10 9:03 AM, Jason Bertoch wrote: Are there any internal checks that disable Bayes autolearn when these artificial whitelist rules match? I'd disabled these rules in versions prior to 3.3.0 but, with all the discussion on the matter, I thought I'd leave them in to see the new and improved version. Unfortunately, I'm still seeing false positives and am concerned that they are pushing the scores low enough to poison my Bayes database. you can edit the tflags and add noautolearn example: 72_active.cf:tflags__RCVD_IN_DNSWLnice net becomes: 72_active.cf:tflags__RCVD_IN_DNSWLnice net noautolearn Are these settings cumulative? The man page doesn't specify. If I do this: tflagsRULENAMEnice net tflagsRULENAME noautolearn what happens? Does everything get set or do I only get 'noautolearn'? -- Bowie
Re: RetrunPath and Bayes Poisoning
On 2/23/2010 9:20 AM, Michael Scheidell wrote: Unfortunately, I'm still seeing false positives and am concerned that they are pushing the scores low enough to poison my Bayes database. you can edit the tflags and add noautolearn example: 72_active.cf:tflags RCVD_IN_RP_CERTIFIEDnet nice 72_active.cf:tflags RCVD_IN_RP_SAFE net nice becomes: 72_active.cf:tflags RCVD_IN_RP_CERTIFIEDnet nice noautolearn 72_active.cf:tflags RCVD_IN_RP_SAFE net nice noautolearn Nice, I didn't realize it worked like that. To make this permanent, do I need to set the score to zero and copy the rules to a different name in local.cf, or will a second tflags declaration in local.cf simply override the one in 72_active.cf? /Jason smime.p7s Description: S/MIME Cryptographic Signature
Re: RetrunPath and Bayes Poisoning
On 2/23/10 9:28 AM, Bowie Bailey wrote: Michael Scheidell wrote: On 2/23/10 9:03 AM, Jason Bertoch wrote: Are there any internal checks that disable Bayes autolearn when these artificial whitelist rules match? I'd disabled these rules in versions prior to 3.3.0 but, with all the discussion on the matter, I thought I'd leave them in to see the new and improved version. Unfortunately, I'm still seeing false positives and am concerned that they are pushing the scores low enough to poison my Bayes database. you can edit the tflags and add noautolearn example: 72_active.cf:tflags__RCVD_IN_DNSWLnice net becomes: 72_active.cf:tflags__RCVD_IN_DNSWLnice net noautolearn Are these settings cumulative? The man page doesn't specify. If I do this: tflagsRULENAMEnice net tflagsRULENAME noautolearn what happens? Does everything get set or do I only get 'noautolearn'? why not just do tflags RULENAME nice net noautolearn (oh.. and to find them, grep '^tflags.*RCVD_IN' *.cf some interesting ones. not sure why they rate a net nice: RCVD_IN_IADB_OPTOUTONLY net nice? describe is: IADB: Scrapes addresses, pure opt-out only or describe RCVD_IN_IADB_NOCONTROLIADB: Has absolutely no mailing controls in place I would think a POSITIVE score for someone who we know violates federal can-spam laws (scrapes addresses. violation of us federal can-spam laws) __ This email has been scanned and certified safe by SpammerTrap(r). For Information please see http://www.secnap.com/products/spammertrap/ __
Re: RetrunPath and Bayes Poisoning
On 2/23/2010 9:35 AM, Michael Scheidell wrote: why not just do tflags RULENAME nice net noautolearn (oh.. and to find them, grep '^tflags.*RCVD_IN' *.cf some interesting ones. not sure why they rate a net nice: Grepping for 'autolearn' turns up the built-in whitelist and blacklist rules. I wonder, why wasn't it applied to the RP and DNSWL rules as well? Perhaps I should request a rule change. Thoughts? /Jason smime.p7s Description: S/MIME Cryptographic Signature
Re: RetrunPath and Bayes Poisoning
Michael Scheidell wrote: On 2/23/10 9:28 AM, Bowie Bailey wrote: Michael Scheidell wrote: On 2/23/10 9:03 AM, Jason Bertoch wrote: Are there any internal checks that disable Bayes autolearn when these artificial whitelist rules match? I'd disabled these rules in versions prior to 3.3.0 but, with all the discussion on the matter, I thought I'd leave them in to see the new and improved version. Unfortunately, I'm still seeing false positives and am concerned that they are pushing the scores low enough to poison my Bayes database. you can edit the tflags and add noautolearn example: 72_active.cf:tflags__RCVD_IN_DNSWLnice net becomes: 72_active.cf:tflags__RCVD_IN_DNSWLnice net noautolearn Are these settings cumulative? The man page doesn't specify. If I do this: tflagsRULENAMEnice net tflagsRULENAME noautolearn what happens? Does everything get set or do I only get 'noautolearn'? why not just do tflags RULENAME nice net noautolearn (oh.. and to find them, grep '^tflags.*RCVD_IN' *.cf If I can just add 'noautolearn' in my local.cf, then I don't have to worry about what is currently set in the distributed rules. And if an update adds or removes a setting, it will happen automatically without me having to mess with it. -- Bowie
Re: RetrunPath and Bayes Poisoning
On Tue, 2010-02-23 at 09:28 -0500, Bowie Bailey wrote: Michael Scheidell wrote: you can edit the tflags and add noautolearn Are these settings cumulative? The man page doesn't specify. Nope. tflags is of type CONF_TYPE_HASH_KEY_VALUE, so there's exactly one tflags value per rule name. tflagsRULENAMEnice net tflagsRULENAME noautolearn what happens? Does everything get set or do I only get 'noautolearn'? The latter wins and overwrites the former. -- char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Solution to Bayes poisoning, high load levels, image spam, and botnet spam
I'm seeing a lot of people saying that bayes isn't working like it used to, that load levels are high, and that they are getting a lot of image and botnet spam. There are a few simple tricks you can do to get rid of 90% of it. First - use dummy MX records. Real mail retries. Botnet and must spammers don't. It's easier for them to try to spam someone else than to fight your filter. MX config is as follows: dummy - 10 real - 20 real-backups - 30 dummy - 40 dummy - 50 dummy - 60 ... All dummy IP addresses are dead IPs. Port 25 closed. Don't do a 4xx on the lowest numbers IP because QMail is brain dead and won't retry the higher numbered servers. The upper MX can return 4xx if you want to log botnet traffic. This will eliminate 75%-90% of your spam with no false positives ust making this change. Second - use blacklists in a way that blocks the spam, not just score it. If you use the spamhaus list you,ll get rid of about 1/3 of what's left. Then - you just let SA process the rest. What you'll find is that most all botnet spam will be gone, Bayes will start working again. Load levels will drop dramatically. Another thing - I don't know what everyone else uses but Exim is my MTA and it has the power to be easily configured to do just about anything you can imagine. If you are unhappy with your MTA Exim is the what I think is the right choice. Another solution is to just have me get rid of your spam for you and make the problem go away. If anyone is tired of all this and just wants it done you can email me privately and I'll set you up.
Re: Solution to Bayes poisoning, high load levels, image spam, and botnet spam
Marc Perkel schrieb: I'm seeing a lot of people saying that bayes isn't working like it used to, that load levels are high, and that they are getting a lot of image and botnet spam. There are a few simple tricks you can do to get rid of 90% of it. 56th reinvention of the square wheel You might wanna search this lists archive for further comments ... arni
Re: Solution to Bayes poisoning, high load levels, image spam, and botnet spam
Marc Perkel schrieb: I'm seeing a lot of people saying that bayes isn't working like it used to, that load levels are high, and that they are getting a lot of image and botnet spam. There are a few simple tricks you can do to get rid of 90% of it. ah nice can you tell me how to implant this ins SpamAssassin
Re: Solution to Bayes poisoning, high load levels, image spam, and botnet spam
First - use dummy MX records. Real mail retries. Botnet and must spammers don't. It's easier for them to try to spam someone else than to fight your filter. MX config is as follows: dummy - 10 real - 20 real-backups - 30 dummy - 40 dummy - 50 dummy - 60 Currently I have mail.mydomain.com as 10. Can I just change that to 20 and add mail5.mydomain.com as 10 but not have an IP associated with mail5.mydomain.com or will that cause trouble? Matt
Re: Solution to Bayes poisoning, high load levels, image spam, and botnet spam
Matt wrote: First - use dummy MX records. Real mail retries. Botnet and must spammers don't. It's easier for them to try to spam someone else than to fight your filter. MX config is as follows: dummy - 10 real - 20 real-backups - 30 dummy - 40 dummy - 50 dummy - 60 Currently I have mail.mydomain.com as 10. Can I just change that to 20 and add mail5.mydomain.com as 10 but not have an IP associated with mail5.mydomain.com or will that cause trouble? Matt Are you sure about this approach? Most of what hits our backup server, listed at a higher MX record, is spam. I was, and am, under the impression that many spambots are set to fire at higher MXs under the assumption that admins might not spend as much time on the anti-spam set-up of this servers.
Re: Solution to Bayes poisoning, high load levels, image spam, and botnet spam
Craig Carriere wrote: Matt wrote: First - use dummy MX records. Real mail retries. Botnet and must spammers don't. It's easier for them to try to spam someone else than to fight your filter. MX config is as follows: dummy - 10 real - 20 real-backups - 30 dummy - 40 dummy - 50 dummy - 60 Currently I have mail.mydomain.com as 10. Can I just change that to 20 and add mail5.mydomain.com as 10 but not have an IP associated with mail5.mydomain.com or will that cause trouble? Matt Are you sure about this approach? Most of what hits our backup server, listed at a higher MX record, is spam. I was, and am, under the impression that many spambots are set to fire at higher MXs under the assumption that admins might not spend as much time on the anti-spam set-up of this servers. Yes - the trick works two ways. If the spambots hit the high server then there's nothing there and they go on. If they hit the lowest numbered server they also get nothing and go on. A real server will hit the lowest number MX and get nothing and then retry and get the second lowest one which is real. The trick relies on the idea that spambots unlike real server won't walk the MX order looking for the real server. If I were a spammer I would think it easier to move on to the next email address than to try to fight a good spam filter.
Re: bayes poisoning
maillist wrote: I see a few emails every-now-and-then about bayes poisoning, and am wondering what is means. From what I understand, it is some message that gets learned (only through autolearn?) that has certain characteristics that throw the bayes system off. From what I've seen there are generally two ways it is referred to: 1. random text or phrases thrown into spam to make it look like spam and ham look more alike: This is an imagined problem. 2. spam incorrectly leanred as ham or ham incorrectly learned as spam: Enough of these (either from manual or auto learning) and your Bayes database will be useless. -- Chris
Bayes Poisoning
I've been having problem with bayes as of late with it marming nonspam as spam and spam as nonspam. I think it's the damn gif file spam causing it. Anyone else having this problem? Any solutions?
Bayes poisoning (was Re: your mail)
The messages are simply a random stream of words, with punctuation scattered in them. No HTML, no URLs being advertised, no excessive capitalisation, just meaningless text. Technically, then, it's not spam. Spam requires a commercial message of some sort. :) Yeah, I think I said 'junk' rather than spam. I wonder if such mail has a name? I would agree that it's an attempt to poison your bayes database, assuming that you have autolearn turned on, either by skewing the scores towards ham or by bloating the database. Do you think the perpetrators are poisoning the bayes db with a view to sending spam at a later date? We aren't a big organisation - few hundred mail boxes - so it seems rather long lengths for a spammer to go to. Another suggestion was that the spammer had intended to attach an image, which hadn't got through. Given the technical competence of many spammers, it seems more likely they screwed up and forgot to attach the image. But I'm just guessing here. Any thoughts on what I can do about these messages? Even with bayes turned off, they would still fail to score more than say 2 or 3. Each message contains a different paragraph of random text, so it's not possible to pick out keywords; and the messages are coming from dialup machines, so blocking IP isn't going to be very effective. Look for punctuation? A good deal of the random bayes poison at one time was totally without punctuation. I'm cautious about feeding these messages to sa-learn as spam, in case it has a negative impact on genuine messages. The punctuation is pretty good - full stops every dozen words or so, the odd comma. In fact, it's probably better punctuation than most of my users use:) At the moment I'm just black-listing host or netblocks which this junk is coming from. Apologies for not setting a subject in my original mail by the way Peter Smith
RE: Bayes poisoning (was Re: your mail)
Peter Smith wrote: The messages are simply a random stream of words, with punctuation scattered in them. No HTML, no URLs being advertised, no excessive capitalisation, just meaningless text. I'm cautious about feeding these messages to sa-learn as spam, in case it has a negative impact on genuine messages. The punctuation is pretty good - full stops every dozen words or so, the odd comma. In fact, it's probably better punctuation than most of my users use:) At the moment I'm just black-listing host or netblocks which this junk is coming from. As long as you learn the messages as spam, they will have no negative impact. The only way these messages could cause problems is if they get autolearned as ham instead of spam. -- Bowie
Bayes poisoning ?
Hi We are using Spamassassin + Postfix + Mailscanner on our SMTP servers. Of late I have noticed that a lot of ham mails are getting a high BAYES score. I have overriden bayes with lower scores in order to avoid false postives ( and possibly mail loss ) How do I de-poison the bayes database, and are there any ways to avoid bayes poisoning ? Thanks Ram -- Netcore Solutions Pvt. Ltd. Website: http://www.netcore.co.in Spamtraps: http://cleanmail.netcore.co.in/directory.html --
Re: Bayes poisoning ?
The best thing to do is probably throw the current database away and start over. As you seem to have several users, you should have bayes working again within a very few hours, or less. You should delete the current database, reset the scores to normal (and increase the bayes_99 score to something around 4 if you aren't using 3.0.4), and then manually train Bayes on a few hundred known ham and spam before letting autolearning take over. The other thing you should do is decrease bayes autolearn ham threshold to 0 or even -.1 or so. By default it is too high, and will far too often lead to bayes poisioning if the state of the database isn't watched carefully. You may also want to take the bayes autolearn spam threshold up to a higher value than it has by default; although this usually isn't required. Loren