Re: Massive Spam Attack?
Given the level of the traffic, you might look at implementing something like Deny Spammers at he /24 level (rather than the host level). https://sourceforge.net/projects/deny-spammers/ -faisal On May 13, 2007, at 12:15 AM, Jason Frisvold wrote: On 5/12/07, Jason Frisvold [EMAIL PROTECTED] wrote: I installed the botnet plugin today, but it's not going to help anyway.. The IPs these are coming from resolve to a variety of different hostnames, all without triggering botnet at all. Here's a sample of the hits I'm getting ... As you can see, its a bunch of different IPs in various ranges.. I've decided to just block the ranges at this point.. I have no idea if there's anything legit in there, but I'll take that risk... baseball142.pamwheeled.com (66.96.245.142) baseball15.hammersmoky.com (66.96.245.15) baseball167.pamwheeled.com (66.96.245.167) baseball168.pamwheeled.com (66.96.245.168) baseball184.itlivestock.com (66.96.245.184) baseball20.hammersmoky.com (66.96.245.20) baseball210.itlivestock.com (66.96.245.210) baseball237.burmesetow.com (66.96.245.237) baseball247.burmesetow.com (66.96.245.247) baseball31.hammersmoky.com (66.96.245.31) baseball6.hammersmoky.com (66.96.245.6) baseball75.platenormal.com (66.96.245.75) crowflies110.yentropical.com (65.111.26.110) crowflies131.yentropical.com (65.111.26.131) crowflies15.mowcraving.com (65.111.26.15) crowflies168.ropepin.com (65.111.26.168) crowflies176.ropepin.com (65.111.26.176) crowflies186.ropepin.com (65.111.26.186) crowflies19.mowcraving.com (65.111.26.19) crowflies33.mowcraving.com (65.111.26.33) crowflies42.mowcraving.com (65.111.26.42) crowflies57.beforefor.com (65.111.26.57) crowflies63.beforefor.com (65.111.26.63) lampshade144.acidicbee.com (66.240.249.144) lampshade153.acidicbee.com (66.240.249.153) lampshade161.acidicbee.com (66.240.249.161) lampshade183.acidicbee.com (66.240.249.183) lampshade183.acidicbee.com (66.240.249.183) lampshade213.acidicbee.com (66.240.249.213) lampshade231.acidicbee.com (66.240.249.231) lampshade231.acidicbee.com (66.240.249.231) lampshade239.acidicbee.com (66.240.249.239) later112.itbobble.com (216.74.88.112) later13.divesthow.com (216.74.88.13) later15.divesthow.com (216.74.88.15) later189.tarponway.com (216.74.88.189) later20.divesthow.com (216.74.88.20) later216.usefulget.com (216.74.88.216) later217.usefulget.com (216.74.88.217) later225.usefulget.com (216.74.88.225) later250.usefulget.com (216.74.88.250) later69.itbobble.com (216.74.88.69) mail136.yenram.com (64.191.11.136) mail237.todinto.com (64.191.11.237) mail239.todinto.com (64.191.11.239) mail250.todinto.com (64.191.11.250) mail91.rangeat.com (64.191.11.91) movie113.fencingnow.com (216.10.25.113) movie119.fencingnow.com (216.10.25.119) movie120.fencingnow.com (216.10.25.120) movie126.fencingnow.com (216.10.25.126) movie166.measleit.com (216.10.25.166) movie184.measleit.com (216.10.25.184) movie207.fosteris.com (216.10.25.207) movie78.fencingnow.com (216.10.25.78) mustang214.pugto.com (72.37.196.214) mustang242.pugto.com (72.37.196.242) omega172.dressyoung.com (66.197.254.172) omega199.dressyoung.com (66.197.254.199) omega225.dressyoung.com (66.197.254.225) omega237.dressyoung.com (66.197.254.237) omega86.byknife.com (66.197.254.86) pick17.heatscanna.com (64.192.26.17) pick182.runninghit.com (64.192.26.182) rainy206.grimacehot.com (66.96.252.206) rush100.standbot.com (66.96.255.100) rush101.standbot.com (66.96.255.101) rush103.standbot.com (66.96.255.103) rush131.ifweight.com (66.96.255.131) rush188.whobeak.com (66.96.255.188) rush206.whobeak.com (66.96.255.206) rush208.whenpile.com (66.96.255.208) rush226.whenpile.com (66.96.255.226) rush232.whenpile.com (66.96.255.232) rush236.whenpile.com (66.96.255.236) rush251.whenpile.com (66.96.255.251) source238.wearisen.com (216.74.120.238) source244.wearisen.com (216.74.120.244) teaching200.wordssort.com (64.192.28.200) teaching33.camelcoat.com (64.192.28.33) -- Jason 'XenoPhage' Frisvold [EMAIL PROTECTED] http://blog.godshell.com
Re: Massive Spam Attack?
On May 11, 2007, at 10:54 PM, Jason Frisvold wrote: It appears that each mail is sent by a unique IP, so it doesn't look like a simple firewall rule will stop this. Is every single message coming from a unique IP, or is it just that they're widely distributed? -faisal
Re: Any drawbacks of cron-scheduled bayesian leanring?
On Apr 25, 2007, at 5:49 AM, Arik Raffael Funke wrote: I was wondering if it has any negative effects on my Bayes database if I regularly learn all spam/ham messages via a cron job. Sa-learn skips already learned messages. Am I thus right to assume that apart from the relatively high CPU load there are no drawbacks? Or should I keep a separate folder for new spam/ham? I did this for a while and didn't find any problems. I'm using Maildir, and I only trained on the cur folders, not the new folders. In theory this would prevent me from training on something that had come in mis-filed (so long as I remembered to quit my mail client at night). See here for details and a script to do this: http://www.faisal.com/software/sa-harvest/ Note that this script will also attempt to rebuild your whitelist (all the code after the 'sa-learn --dump magic'). This has some downsides, and turns out to be less useful with modern Spamassassin, so I'm reworking the script to break out the whitelist code into a separate script. That said, I keep a rolling 1 month corpus of spam, so it's easy to retrain when I need to. I stopped doing full retrains on cron, and at this point I only retrain on messages that were misfiled. See: http://www.faisal.com/software/sa-harvest/quicktrain.xhtml If you're doing any of this on a shared system, my one bit of advice is to set up the cron to use 'batch' and 'nice'. -faisal
Re: Any drawbacks of cron-scheduled bayesian leanring?
On Apr 25, 2007, at 4:30 PM, Arik Raffael Funke wrote: I am now probably venturing off-topic on my own thread but the point you make is interesting: You train only misfiled messages. What about new but correctly filed messages? You _never_ train on them? Given that bayes is a statistical method, is it really sufficient to only train on the mis-files? the nightly cron job trained against the spam folder and a subset of the read folders likely to have spam in them (archive, recent working folders, etc.). i'd periodically retrain across the entire mail tree. the retraining only for specific misfiled messages handles both spam and hand. retraining only on misfiles is not as accurate as training on all mail, but is a lot lighter weight, so i can run it every 5 minutes instead of every night. The proportional spam/ham weight of keywords would in this case not be adjusted in the database if/when they change in your mail traffic, or? Are you not encountering a higher number of mis-files compared to your previous learning practise? the number of misfiles i get is so low that it's hard to tell if there's a difference. i periodically get floods of new false- negatives, but those typically correct after the first few are retrained. when retraining across the entire mail spool the problems usually corrected after the first night. -faisal
Re: Don't want hatfield.com to send mail to mccoy.com - can /etc/mail/spamassassin/local.cf help?
Out of curiosity, is there a reason to do this in SA vs. at the MTA, firewall, etc? -faisal -used to work with a Hatfield and is friends with a McCoy On Apr 24, 2007, at 12:33 AM, John Schmerold wrote: SA is protecting 20 domains from evil, I want to keep 2 domains from communicating with one another, I believe local.cf can help resolve this for me, if I can figure out how to do: scoreLOCAL__H_M 50.00 header LOCAL__H_M From =~ /hatfield\.com/i .and. header LOCAL__H_M To =~ /mccoy\.com/i describe LOCAL__H_M Hatfield to McCoy scoreLOCAL__M_H 50.00 header LOCAL__M_HFrom =~ /mccoy\.com/i .and. header LOCAL__M_H To =~ /hatfield\.com/i describe LOCAL__M_H McCoy to Hatfield So, this newbie has 2 questions: 1. Can this be done 2. How to do it - I suspect the answer lies in the stack of regex information I've been staring at, but can figure out TIA
Re: sa-learn: have i seen this before?
On Apr 21, 2007, at 11:49 PM, Matt Kettler wrote: Try adding a -D to sa-learn.. if it's lock contention, you should see a bunch of messages about it waiting for the lock. i did this earlier (after some mucking about with file tracing tools) and found that most of the wait seems to be in two places: - loading user_prefs (which in my case has some auto-generated portions that could be substantially trimmed) - one of the rules files (i'm trying to isolate what rule) to the point of filtering, i do wonder if i can solve my what to scan problem by setting my own custom imap flags. since maildir records that information in the filename this would allow me to easily call sa-learn once on all files matching that pattern. -faisal
Re: sa-learn: have i seen this before?
On Apr 22, 2007, at 9:05 AM, Matt Kettler wrote: You don't have sa-blacklist, do you? no, but i had a whitelist with almost 5,000 entries -faisal
Re: sa-learn: have i seen this before?
On Apr 21, 2007, at 1:30 AM, Matt Kettler wrote: 2. which way do i learn it. Erm, if it's spam, learn it as spam.. if it's nonspam, learn it as nonspam. What's the problem here? i have a program looking through for untrained messages and deciding what to train them as. alternatively, i have a program looking through and training all messages in a folder, deciding how to train on the fly. What you want to do would reduce efficiency by making SA take two passes. In the first pass, it parses all the headers of every message, and tells you which ones it's learned or not. a couple issues here: 1. the headers do not necessarily tell the truth -- if you train on a message after it arrives then the headers will still say the same as written at delivery time. and, as you point out, parsing the headers is an ugly way to do it. 2. depending on how fast the have i trained this message before lookup is, this could still beat training every message. as it is i'm looking at 19-20 seconds to [not] retrain a previously trained messages on a fairly unloaded box. i'm guess i could write a wrapper script around the sa-learn functions to keep a seperate db of what has and hasn't been trained. Then you use some external sorter Then you call SA to learn the messages that weren't learned. It now has to re-parse the headers from scratch, then parse/tokenize and learn the body. why call a separate sorter? do something more like: for my $message (@messages) { learn($message) unless (already_learned($message)) } -faisal
Re: sa-learn: have i seen this before?
On Apr 21, 2007, at 11:23 AM, Matt Kettler wrote: Ok, but how does knowing what SA learned it as help? It doesn't. Figure out what to train as, and train. it helps in that i can automatically iterate over some or all of my mail folders on a regular basis, selectively retraining *if*: a) the message has already been trained b) it's been trained the same way that i want it trained in the end and c) the cost of determining it's already been trained is substantially lower than the cost of just training it right now i do this manually: i have a retrain as spam folder and a retrain as ham folder and i hit them each every 5 minutes. i'd rather get rid of the folders, which lets me then use the client-side junk mail systems to flag messages as spam or ham, which sa would then pick up to retrain. I never suggested that you should parse the headers. sa-learn does this to extract the message-id and compare that to the bayes_seen database. sa-learn *MUST* do this much to determine if the message has already been learned. There's NO other way. even so, it should be possible to parse the message, extract the message-id, and compare the results in 20 seconds. That's a separate sorter. sa-learn already does this internally, so *any* code on your part is a waste. if sa-learn already does this internally then it's doing it rather inefficiently. 20 seconds to pull a message id and compare it against the db (berkeleydb, fwiw)? -faisal
Re: sa-learn: have i seen this before?
On Apr 21, 2007, at 1:34 PM, Matt Kettler wrote: time sa-learn on it, and feed it the WHOLE DIRECTORY at once. Do not iterate messages, do not specify filenames, just give sa-learn the name of the directory. Doing this on a directory with 6 messages takes about a second more than doing it for a single message, which is promising. That said, it isn't noticeably faster (tenths of a second) the second time (timed using /usr/bin/time). If it's not, and the first pass did learn messages, you've got a problem. That's promising (I have a problem, but problems can be found). The other possibility is you've got write-lock contention. You can avoid a lot of this by using the bayes_learn_to_journal option, at the expense of causing your training to not take effect until the next sync. For batch scripts I'm pretty comfortable doing everything with --no- sync, with a --sync at the end. -faisal
Re: sa-learn: have i seen this before?
On Apr 21, 2007, at 2:11 PM, Matt Kettler wrote: Ok, I just did some testing. Something is *VERY* wrong with your system.. Are you running out of ram and swapping? Hrm. top currently reports 123mb free (out of 2g physical, with some swapping. sa-learn has a 62M RSS. This is a shared system with a bunch of activity, so I can't easily isolate all issues, but I'll keep digging. Repeated the test almost (but not completely) as you described and it still took 20 seconds.File lock-contention sounds plausible. -faisal
Re: Fighting ham
On Apr 18, 2007, at 4:26 PM, Robert Fitzpatrick wrote: Thanks, we are rebuilding bayes and now have in SQL with auto learn on, is that good? Now has over 25K spam, but just 180 ham. You *really* want to train with more ham than spam. -faisal
sa-learn: have i seen this before?
Is there an easy way to tell if sa-learn has learned a given message before? -faisal
Re: sa-learn: have i seen this before?
On Apr 16, 2007, at 9:34 PM, Matt Kettler wrote: Try to learn it, if it comes back with something to the affect of: learned from 0 messages, processed 1.. then it's already been learned. this seems to be the common suggestion. it has a couple drawbacks, as i see it: 1. it's relatively cpu-intensive if i want to do it all the time (e.g. scan my spam folder to learn only the messages which haven't already been learned) 2. which way do i learn it. to step back a bit, my final goal is to be able to figure out which messages in a folder haven't been learned, and learn only those. in the ideal situation i can also figure out (ahead of time), whether a learned message was learned as ham or spam. this may be semi-impossible. on the other hand, what can i learn from the headers? e.g. it looks like autolearn=[something] will tell me about the autolearner, but is there anything for manual learns? where i'm going with all this: i can run a cron job to learn the contents of different mailboxes on a regular basis. what i do now is have a TrainSpam and TrainHam mailbox, and when something gets misfiled (in Spam or any ham folder) i just move it in there. every 5 minutes a cron job goes through and scans things appropriately. http://www.faisal.com/software/sa- harvest/quicktrain.html first, i'd like to be able to do that within the mailboxes rather than using special mailboxes. second, i'd like to be able to key off junk mail flags set by the client (thunderbird, apple mail). i'm using dovecot, so it's a fairly simple matter of parsing Maildir filenames, but to do it right i need to combine the knowledge with what spamassassin thinks. i might just go write a dovecot plugin to do this in real-time, but i'm not feeling the motivation to break the mail server with a misplaced pointer. -faisal
Re: SpamAssassin Coach - Outlook/Thunderbird Plugin
this is really promising, but i think it sort of points out some deficiencies in the current state of handling sa things from the client side. i'm wondering if it would make sense to create a separate learner server that deals with this stuff, with this server calling the training routines. on the other hand i wonder if the real solution is something like imap protocol extensions for: learn as spam learn as ham learn as whitelisted address learn as blacklisted address (anything else?) ...and count on server vendors to integrate with spamassassin. this would have the advantage of not having to write a server (or deal with ssl, security, etc.), and not requiring users to configure their connection to the spam sver, but it also puts a dependency on server authors out there. -faisal
Re: test my bleeding edge broken code. with your finger!
On Aug 28, 2006, at 3:52 AM, Loren Wilton wrote: img src=3Dhttp://alaska.aif1.com/pr.asp?src=3D1155591075; width=3D1 height=3D1 border=3D0/ img src=3Dhttp://images.ed4.net/images/htdocs/alaska/ head_left.gif width=3D436 height=3D78 a href=3Dhttp://alaska.aif1.com/pr.asp?src=3D1155591075;img src=3Dhttp://images.ed4.net/images/htdocs/alaska/060816/Mexico- Sweep-graphic.jpg border=3D0 width=3D161 height=3D110/a/td a href=3Dhttp://alaska.aif1.com/pr.asp?src=;apply today/a, then start These things normally score about 25 points. none of these should trip phisher's rule -- it should only trip on text that looks like a domain name. (this does leave the door open to a graphic that says paypal.com in a typeface that matches the rest of the message.) the only apparently legitimate mail i've received that masks a url is from the aaa, and they seem to switch vendors every couple months, so i'm not inclined to trust anything from them (if i start to do so then when do i know what's real and what's fake?). -faisal
Fwd: test my bleeding edge broken code. with your finger!
[reordering mail slightly] On Aug 26, 2006, at 3:07 PM, [EMAIL PROTECTED] wrote: I have suggested something like this a few times. and used to hear concerns about valid links not necessarily the same. These can be put into two groups: one would have links to a related server, like cgi.bigcompany.com I do some normalization on domain names, but it's pretty limited. That said, it could be extended (e.g. right now it drops a leading www., but it could also drop cgi., etc.). This won't completely take care of the problem, but it might improve it somewhat. For the first case I would like to suggest: if the names do not match, chech if the IPs are in the same /24 I'm not sure how effective this would be - I've seen a number of cases of servers being across wildly disparate subnets. Does anyone have a sense of the real-world distribution? The other one is totally unrelated, say a marketing company has set up a redirector to count how often each link is visited. Well, for the other one . I would not want to read these mails even if they are not phish Agreed -- there are attacks that rely on similar redirection mechanisms and there's a certain level of if you insist on acting like a scammer I'm going to treat you like one. An additional comment about phish: I get a lot of stuff that does not even make it to SA scanning because I do not appear as a recipient. One can probably safely assume that paypal, or any bank, would not send a verification message to 100 recipients at once with a bcc list could serve as a meta rule to triple the score for phish I'll have to play with that. -faisal
test my bleeding edge broken code. with your finger!
two bits of sa related code i've written, neither of them are what i'd particularly call polished, but if you feel like firing them up, i'd love to hear your feedback: Phisher: http://www.faisal.com/software/phisher/ This is a plugin that does nothing more complicated than check for the case of something like a href=http://scam.ru;www.paypal.com/ a. I've run it on and off since August of last year, although most of the time was not after 3.1.1 (which is why I only claim it works on 3.1). I don't have a suggested score for it (would love feedback there). I ran it at .1 mostly to see how much it triggered and fp'd (not much, as it turns out. I know this has been a problem in the past, so I'm wondering if the normalization code helps there, or I've just been lucky). As noted, this has some rewrite bits coming when I get some time. sa-harvest: http://www.faisal.com/software/sa-harvest/ This is a script that does several obvious things and one possibly not-so-obvious thing: - You configure it, telling it what your spam and ham folders are, and after that it will automatically train whenever you invoke it, without having to explicitly configure folders to scan (I find this useful for cron jobs, and less typing when I'm doing the same obvious thing every couple days). - It also scans your ham boxes and automatically rebuilds your whitelist based on the contents of presumed food folders (this will mangle your user_prefs. READ THE DOCS ON HOW THIS WORKS SO YOU DON'T LOSE OTHER SETTINGS.) I've been using variants of this script since about a week after the first SA with training came out. I finally generalized it a little last month, and have been running it nightly via cron ever since. Feedback would be greatly appreciated. -faisal
Re: Why is there so much hype behind Image spam
On Jul 16, 2006, at 1:00 AM, John Andersen wrote: And yet, in spite of your statistics, there is more spam than ever. Some estimates are that in excess of 95% of all email is spam. I'm unconvinced of this -- my spam load has leveled off at 200 per day. On the order of 1 per week makes it into my inbox. The latter is due to SA plus some additional code for better white-listing (which I'm planning to release as soon as I solve a couple issues). The former is entirely out of my control (and note that my email address is all over the place). I suspect one or both of the following are true: 1. the volume of spam is linearly related to the number of email addresses out there. the volume of ham is not: the amount of ham is related to the amount of ham *sent* which follows an exponential distribution. an increase in the number of users does not result in an proportional increase of the amount of ham, but does result in a proportional increase in the amount of spam. 2. spam will (or may have already) hit an economic equilibrium. you could look at this as a supply and demand problem: spam demand is the amount of people who are actually willing to buy things they get offers for in spam. spam supply is the number of sellers who are willing to sell things via spam. sending 200m messages still costs money (albeit very little of it), and sending 800m messages to get the same number of buys doesn't make sense for the spammer. whether or not we have SA, Razor, etc., there comes a point where it isn't worth spammers at large to send additional spam. spam filtering increases the average cost of a sale to the seller, so the marginal revenue of a spam run has to be higher for the mailing to be worth it. -faisal
tsa vs., well, us
TSA wants people to turn off their spam filters: http://www.schneier.com/blog/archives/2005/08/tsa_and_spam.html i'd suggest some sort of tsa whitelist rule, but i'm guessing if that happens i'll soon start seeing mail from the tsa's department of herbal viagra. -faisal
Re: annoying changes in 3.0
I think changing the config options from release to release makes it very hard for sites larger than a couple users to upgrade, because you have to rely on every user updating their config when you upgrade. The UPGRADE file helps with this somewhat, but it's more geared for admins than users. -faisal
Re: ver 3.0 opinions
Looks like somebody didn't read the UPGRADE doc... Due to the database format change, you will want to do something like this when upgrading: I read it and followed the directions and didn't see any problem for a couple days and then suddenly the spam level jumped substantially. Upon further investigation it looked like the bayes dbs had gotten corrupted and I was seeing low tok counts like the original poster had reported. Another friend saw this happen when he first upgraded, although I can't speak for his direction-following on the upgrade. If I see more of this I'll try and isolate it and file an appropriately detailed bug report, but I don't think it's necessarily accurate to immediately write this off as user error. -faisal