Re: Massive Spam Attack?

2007-05-13 Thread Faisal N Jawdat
Given the level of the traffic, you might look at implementing  
something like Deny Spammers at he /24 level (rather than the host  
level).


https://sourceforge.net/projects/deny-spammers/

-faisal

On May 13, 2007, at 12:15 AM, Jason Frisvold wrote:


On 5/12/07, Jason Frisvold [EMAIL PROTECTED] wrote:

I installed the botnet plugin today, but it's not going to help
anyway..  The IPs these are coming from resolve to a variety of
different hostnames, all without triggering botnet at all.


Here's a sample of the hits I'm getting ...  As you can see, its a
bunch of different IPs in various ranges..  I've decided to just block
the ranges at this point..  I have no idea if there's anything legit
in there, but I'll take that risk...

baseball142.pamwheeled.com (66.96.245.142)
baseball15.hammersmoky.com (66.96.245.15)
baseball167.pamwheeled.com (66.96.245.167)
baseball168.pamwheeled.com (66.96.245.168)
baseball184.itlivestock.com (66.96.245.184)
baseball20.hammersmoky.com (66.96.245.20)
baseball210.itlivestock.com (66.96.245.210)
baseball237.burmesetow.com (66.96.245.237)
baseball247.burmesetow.com (66.96.245.247)
baseball31.hammersmoky.com (66.96.245.31)
baseball6.hammersmoky.com (66.96.245.6)
baseball75.platenormal.com (66.96.245.75)
crowflies110.yentropical.com (65.111.26.110)
crowflies131.yentropical.com (65.111.26.131)
crowflies15.mowcraving.com (65.111.26.15)
crowflies168.ropepin.com (65.111.26.168)
crowflies176.ropepin.com (65.111.26.176)
crowflies186.ropepin.com (65.111.26.186)
crowflies19.mowcraving.com (65.111.26.19)
crowflies33.mowcraving.com (65.111.26.33)
crowflies42.mowcraving.com (65.111.26.42)
crowflies57.beforefor.com (65.111.26.57)
crowflies63.beforefor.com (65.111.26.63)
lampshade144.acidicbee.com (66.240.249.144)
lampshade153.acidicbee.com (66.240.249.153)
lampshade161.acidicbee.com (66.240.249.161)
lampshade183.acidicbee.com (66.240.249.183)
lampshade183.acidicbee.com (66.240.249.183)
lampshade213.acidicbee.com (66.240.249.213)
lampshade231.acidicbee.com (66.240.249.231)
lampshade231.acidicbee.com (66.240.249.231)
lampshade239.acidicbee.com (66.240.249.239)
later112.itbobble.com (216.74.88.112)
later13.divesthow.com (216.74.88.13)
later15.divesthow.com (216.74.88.15)
later189.tarponway.com (216.74.88.189)
later20.divesthow.com (216.74.88.20)
later216.usefulget.com (216.74.88.216)
later217.usefulget.com (216.74.88.217)
later225.usefulget.com (216.74.88.225)
later250.usefulget.com (216.74.88.250)
later69.itbobble.com (216.74.88.69)
mail136.yenram.com (64.191.11.136)
mail237.todinto.com (64.191.11.237)
mail239.todinto.com (64.191.11.239)
mail250.todinto.com (64.191.11.250)
mail91.rangeat.com (64.191.11.91)
movie113.fencingnow.com (216.10.25.113)
movie119.fencingnow.com (216.10.25.119)
movie120.fencingnow.com (216.10.25.120)
movie126.fencingnow.com (216.10.25.126)
movie166.measleit.com (216.10.25.166)
movie184.measleit.com (216.10.25.184)
movie207.fosteris.com (216.10.25.207)
movie78.fencingnow.com (216.10.25.78)
mustang214.pugto.com (72.37.196.214)
mustang242.pugto.com (72.37.196.242)
omega172.dressyoung.com (66.197.254.172)
omega199.dressyoung.com (66.197.254.199)
omega225.dressyoung.com (66.197.254.225)
omega237.dressyoung.com (66.197.254.237)
omega86.byknife.com (66.197.254.86)
pick17.heatscanna.com (64.192.26.17)
pick182.runninghit.com (64.192.26.182)
rainy206.grimacehot.com (66.96.252.206)
rush100.standbot.com (66.96.255.100)
rush101.standbot.com (66.96.255.101)
rush103.standbot.com (66.96.255.103)
rush131.ifweight.com (66.96.255.131)
rush188.whobeak.com (66.96.255.188)
rush206.whobeak.com (66.96.255.206)
rush208.whenpile.com (66.96.255.208)
rush226.whenpile.com (66.96.255.226)
rush232.whenpile.com (66.96.255.232)
rush236.whenpile.com (66.96.255.236)
rush251.whenpile.com (66.96.255.251)
source238.wearisen.com (216.74.120.238)
source244.wearisen.com (216.74.120.244)
teaching200.wordssort.com (64.192.28.200)
teaching33.camelcoat.com (64.192.28.33)

--
Jason 'XenoPhage' Frisvold
[EMAIL PROTECTED]
http://blog.godshell.com




Re: Massive Spam Attack?

2007-05-11 Thread Faisal N Jawdat

On May 11, 2007, at 10:54 PM, Jason Frisvold wrote:

It appears that each mail is sent by a unique IP, so it doesn't look
like a simple firewall rule will stop this.


Is every single message coming from a unique IP, or is it just that  
they're widely distributed?


-faisal



Re: Any drawbacks of cron-scheduled bayesian leanring?

2007-04-25 Thread Faisal N Jawdat

On Apr 25, 2007, at 5:49 AM, Arik Raffael Funke wrote:
I was wondering if it has any negative effects on my Bayes database  
if I regularly learn all spam/ham messages via a cron job.


Sa-learn skips already learned messages. Am I thus right to assume  
that apart from the relatively high CPU load there are no  
drawbacks? Or should I keep a separate folder for new spam/ham?


I did this for a while and didn't find any problems.

I'm using Maildir, and I only trained on the cur folders, not the new  
folders.  In theory this would prevent me from training on something  
that had come in mis-filed (so long as I remembered to quit my mail  
client at night).


See here for details and a script to do this:

http://www.faisal.com/software/sa-harvest/

Note that this script will also attempt to rebuild your whitelist  
(all the code after the 'sa-learn --dump magic').  This has some  
downsides, and turns out to be less useful with modern Spamassassin,  
so I'm reworking the script to break out the whitelist code into a  
separate script.


That said, I keep a rolling 1 month corpus of spam, so it's easy to  
retrain when I need to.  I stopped doing full retrains on cron, and  
at this point I only retrain on messages that were misfiled.  See:


http://www.faisal.com/software/sa-harvest/quicktrain.xhtml

If you're doing any of this on a shared system, my one bit of advice  
is to set up the cron to use 'batch' and 'nice'.


-faisal




Re: Any drawbacks of cron-scheduled bayesian leanring?

2007-04-25 Thread Faisal N Jawdat

On Apr 25, 2007, at 4:30 PM, Arik Raffael Funke wrote:
I am now probably venturing off-topic on my own thread but the  
point you make is interesting: You train only misfiled messages.  
What about new but correctly filed messages? You _never_ train on  
them?
Given that bayes is a statistical method, is it really sufficient  
to only train on the mis-files?


the nightly cron job trained against the spam folder and a subset of  
the read folders likely to have spam in them (archive, recent working  
folders, etc.).  i'd periodically retrain across the entire mail  
tree.  the retraining only for specific misfiled messages handles  
both spam and hand.


retraining only on misfiles is not as accurate as training on all  
mail, but is a lot lighter weight, so i can run it every 5 minutes  
instead of every night.


The proportional spam/ham weight of keywords would in this case not  
be adjusted in the database if/when they change in your mail  
traffic, or? Are you not encountering a higher number of mis-files  
compared to your previous learning practise?


the number of misfiles i get is so low that it's hard to tell if  
there's a difference.  i periodically get floods of new false- 
negatives, but those typically correct after the first few are  
retrained.  when retraining across the entire mail spool the problems  
usually corrected after the first night.


-faisal



Re: Don't want hatfield.com to send mail to mccoy.com - can /etc/mail/spamassassin/local.cf help?

2007-04-24 Thread Faisal N Jawdat
Out of curiosity, is there a reason to do this in SA vs. at the MTA,  
firewall, etc?


-faisal
-used to work with a Hatfield and is friends with a McCoy

On Apr 24, 2007, at 12:33 AM, John Schmerold wrote:


SA is protecting 20 domains from evil, I want to keep 2 domains from
communicating with one another, I believe local.cf can help resolve
this for me, if I can figure out how to do:

scoreLOCAL__H_M  50.00
header   LOCAL__H_M From =~ /hatfield\.com/i .and.
header   LOCAL__H_M To =~ /mccoy\.com/i
describe LOCAL__H_M Hatfield to McCoy

scoreLOCAL__M_H  50.00
header   LOCAL__M_HFrom =~ /mccoy\.com/i .and.
header   LOCAL__M_H To =~ /hatfield\.com/i
describe LOCAL__M_H McCoy to Hatfield

So, this newbie has 2 questions:
1. Can this be done
2. How to do it - I suspect the answer lies in the stack of regex
information I've been staring at, but can figure out

TIA




Re: sa-learn: have i seen this before?

2007-04-22 Thread Faisal N Jawdat

On Apr 21, 2007, at 11:49 PM, Matt Kettler wrote:
Try adding a -D to sa-learn.. if it's lock contention, you should  
see a bunch of messages about it waiting for the lock.


i did this earlier (after some mucking about with file tracing tools)  
and found that most of the wait seems to be in two places:


- loading user_prefs (which in my case has some auto-generated  
portions that could be substantially trimmed)


- one of the rules files (i'm trying to isolate what rule)

to the point of filtering, i do wonder if i can solve my what to  
scan problem by setting my own custom imap flags.  since maildir  
records that information in the filename this would allow me to  
easily call sa-learn once on all files matching that pattern.


-faisal



Re: sa-learn: have i seen this before?

2007-04-22 Thread Faisal N Jawdat

On Apr 22, 2007, at 9:05 AM, Matt Kettler wrote:

You don't have sa-blacklist, do you?


no, but i had a whitelist with almost 5,000 entries

-faisal




Re: sa-learn: have i seen this before?

2007-04-21 Thread Faisal N Jawdat

On Apr 21, 2007, at 1:30 AM, Matt Kettler wrote:

2.  which way do i learn it.


Erm, if it's spam, learn it as spam.. if it's nonspam, learn it as  
nonspam. What's the problem here?


i have a program looking through for untrained messages and deciding  
what to train them as.  alternatively, i have a program looking  
through and training all messages in a folder, deciding how to train  
on the fly.


What you want to do would reduce efficiency by making SA take two  
passes. In the first pass, it parses all the headers of every  
message, and tells you which ones it's learned or not.


a couple issues here:

1.  the headers do not necessarily tell the truth -- if you train on  
a message after it arrives then the headers will still say the same  
as written at delivery time.  and, as you point out, parsing the  
headers is an ugly way to do it.


2.  depending on how fast the have i trained this message before  
lookup is, this could still beat training every message.  as it is  
i'm looking at 19-20 seconds to [not] retrain a previously trained  
messages on a fairly unloaded box.


i'm guess i could write a wrapper script around the sa-learn  
functions to keep a seperate db of what has and hasn't been trained.



Then you use some external sorter
Then you call SA to learn the messages that weren't learned. It now  
has
to re-parse the headers from scratch, then parse/tokenize and learn  
the

body.


why call a separate sorter?  do something more like:

for my $message (@messages) {
  learn($message) unless (already_learned($message))
}

-faisal





Re: sa-learn: have i seen this before?

2007-04-21 Thread Faisal N Jawdat

On Apr 21, 2007, at 11:23 AM, Matt Kettler wrote:

Ok, but how does knowing what SA learned it as help? It doesn't.

Figure out what to train as, and train.


it helps in that i can automatically iterate over some or all of my  
mail folders on a regular basis, selectively retraining *if*:


a) the message has already been trained
b) it's been trained the same way that i want it trained in the end
and
c) the cost of determining it's already been trained is substantially  
lower than the cost of just training it


right now i do this manually:  i have a retrain as spam folder and  
a retrain as ham folder and i hit them each every 5 minutes.  i'd  
rather get rid of the folders, which lets me then use the client-side  
junk mail systems to flag messages as spam or ham, which sa would  
then pick up to retrain.


I never suggested that you should parse the headers. sa-learn does  
this

to extract the message-id and compare that to the bayes_seen database.
sa-learn *MUST* do this much to determine if the message has already
been learned. There's NO other way.


even so, it should be possible to parse the message, extract the  
message-id, and compare the results in  20 seconds.


That's a separate sorter. sa-learn already does this internally,  
so *any* code on your part is a waste.


if sa-learn already does this internally then it's doing it rather  
inefficiently.  20 seconds to pull a message id and compare it  
against the db (berkeleydb, fwiw)?


-faisal



Re: sa-learn: have i seen this before?

2007-04-21 Thread Faisal N Jawdat

On Apr 21, 2007, at 1:34 PM, Matt Kettler wrote:

time sa-learn on it, and feed it the WHOLE DIRECTORY at once. Do not
iterate messages, do not specify filenames, just give sa-learn the  
name

of the directory.


Doing this on a directory with 6 messages takes about a second more  
than doing it for a single message, which is promising.  That said,  
it isn't noticeably faster (tenths of a second) the second time  
(timed using /usr/bin/time).


If it's not, and the first pass did learn messages, you've got a  
problem.


That's promising (I have a problem, but problems can be found).

The other possibility is you've got write-lock contention. You can  
avoid
a lot of this by using the bayes_learn_to_journal option, at the  
expense

of causing your training to not take effect until the next sync.


For batch scripts I'm pretty comfortable doing everything with --no- 
sync, with a --sync at the end.


-faisal




Re: sa-learn: have i seen this before?

2007-04-21 Thread Faisal N Jawdat

On Apr 21, 2007, at 2:11 PM, Matt Kettler wrote:
Ok, I just did some testing. Something is *VERY* wrong with your  
system.. Are you running out of ram and swapping?


Hrm.  top currently reports 123mb free (out of 2g physical, with some  
swapping.  sa-learn has a 62M RSS.  This is a shared system with a  
bunch of activity, so I can't easily isolate all issues, but I'll  
keep digging.


Repeated the test almost (but not completely) as you described and it  
still took 20 seconds.File lock-contention sounds plausible.


-faisal





Re: Fighting ham

2007-04-18 Thread Faisal N Jawdat

On Apr 18, 2007, at 4:26 PM, Robert Fitzpatrick wrote:
Thanks, we are rebuilding bayes and now have in SQL with auto learn  
on, is that good? Now has over 25K spam, but just 180 ham.


You *really* want to train with more ham than spam.

-faisal





sa-learn: have i seen this before?

2007-04-16 Thread Faisal N Jawdat
Is there an easy way to tell if sa-learn has learned a given message  
before?


-faisal



Re: sa-learn: have i seen this before?

2007-04-16 Thread Faisal N Jawdat

On Apr 16, 2007, at 9:34 PM, Matt Kettler wrote:

Try to learn it, if it comes back with something to the affect of:
learned from 0 messages, processed 1.. then it's already been  
learned.


this seems to be the common suggestion.

it has a couple drawbacks, as i see it:

1.  it's relatively cpu-intensive if i want to do it all the time  
(e.g. scan my spam folder to learn only the messages which haven't  
already been learned)


2.  which way do i learn it.

to step back a bit, my final goal is to be able to figure out which  
messages in a folder haven't been learned, and learn only those.  in  
the ideal situation i can also figure out (ahead of time), whether a  
learned message was learned as ham or spam.


this may be semi-impossible.

on the other hand, what can i learn from the headers?

e.g. it looks like autolearn=[something] will tell me about the  
autolearner, but is there anything for manual learns?


where i'm going with all this:

i can run a cron job to learn the contents of different mailboxes on  
a regular basis.  what i do now is have a TrainSpam and TrainHam  
mailbox, and when something gets misfiled (in Spam or any ham folder)  
i just move it in there.  every 5 minutes a cron job goes through and  
scans things appropriately. http://www.faisal.com/software/sa- 
harvest/quicktrain.html


first, i'd like to be able to do that within the mailboxes rather  
than using special mailboxes.


second, i'd like to be able to key off junk mail flags set by the  
client (thunderbird, apple mail).  i'm using dovecot, so it's a  
fairly simple matter of parsing Maildir filenames, but to do it right  
i need to combine the knowledge with what spamassassin thinks.


i might just go write a dovecot plugin to do this in real-time, but  
i'm not feeling the motivation to break the mail server with a  
misplaced pointer.


-faisal



Re: SpamAssassin Coach - Outlook/Thunderbird Plugin

2006-09-01 Thread Faisal N Jawdat
this is really promising, but i think it sort of points out some  
deficiencies in the current state of handling sa things from the  
client side.


i'm wondering if it would make sense to create a separate learner  
server that deals with this stuff, with this server calling the  
training routines.


on the other hand i wonder if the real solution is something like  
imap protocol extensions for:


learn as spam
learn as ham
learn as whitelisted address
learn as blacklisted address
(anything else?)

...and count on server vendors to integrate with spamassassin.  this  
would have the advantage of not having to write a server (or deal  
with ssl, security, etc.), and not requiring users to configure their  
connection to the spam sver, but it also puts a dependency on server  
authors out there.


-faisal



Re: test my bleeding edge broken code. with your finger!

2006-08-28 Thread Faisal N Jawdat

On Aug 28, 2006, at 3:52 AM, Loren Wilton wrote:
img src=3Dhttp://alaska.aif1.com/pr.asp?src=3D1155591075;  
width=3D1 height=3D1 border=3D0/
img src=3Dhttp://images.ed4.net/images/htdocs/alaska/ 
head_left.gif width=3D436 height=3D78
a href=3Dhttp://alaska.aif1.com/pr.asp?src=3D1155591075;img  
src=3Dhttp://images.ed4.net/images/htdocs/alaska/060816/Mexico- 
Sweep-graphic.jpg border=3D0 width=3D161 height=3D110/a/td
a href=3Dhttp://alaska.aif1.com/pr.asp?src=;apply today/a,  
then start



These things normally score about 25 points.


none of these should trip phisher's rule -- it should only trip on  
text that looks like a domain name.  (this does leave the door open  
to a graphic that says paypal.com in a typeface that matches the  
rest of the message.)  the only apparently legitimate mail i've  
received that masks a url is from the aaa, and they seem to switch  
vendors every couple months, so i'm not inclined to trust anything  
from them (if i start to do so then when do i know what's real and  
what's fake?).


-faisal



Fwd: test my bleeding edge broken code. with your finger!

2006-08-27 Thread Faisal N Jawdat

[reordering mail slightly]

On Aug 26, 2006, at 3:07 PM, [EMAIL PROTECTED] wrote:
I have suggested something like this a few times. and used to hear  
concerns about valid links not necessarily the same.
These can be put into two groups: one would have links to a  
related server, like cgi.bigcompany.com


I do some normalization on domain names, but it's pretty limited.   
That said, it could be extended (e.g. right now it drops a leading  
www., but it could also drop cgi., etc.).  This won't completely take  
care of the problem, but it might improve it somewhat.


For the first case I would like to suggest: if the names do not  
match, chech if the IPs are in the same /24


I'm not sure how effective this would be - I've seen a number of  
cases of servers being across wildly disparate subnets.  Does anyone  
have a sense of the real-world distribution?


The other one is totally unrelated, say a marketing company has set  
up a redirector to count how often each link is visited.


Well, for the other one . I would not want to read these mails  
even if they are not phish


Agreed -- there are attacks that rely on similar redirection  
mechanisms and there's a certain level of if you insist on acting  
like a scammer I'm going to treat you like one.


An additional comment about phish: I get a lot of stuff that does  
not even make it to SA
scanning because I do not appear as a recipient. One can probably  
safely assume
that paypal, or any bank, would not send a verification message to  
100 recipients
at once with a bcc list  could serve as a meta rule to triple  
the score for phish


I'll have to play with that.

-faisal




test my bleeding edge broken code. with your finger!

2006-08-25 Thread Faisal N Jawdat
two bits of sa related code i've written, neither of them are what  
i'd particularly call polished, but if you feel like firing them  
up, i'd love to hear your feedback:


Phisher:
http://www.faisal.com/software/phisher/
This is a plugin that does nothing more complicated than check for  
the case of something like a href=http://scam.ru;www.paypal.com/ 
a.  I've run it on and off since August of last year, although most  
of the time was not after 3.1.1 (which is why I only claim it works  
on 3.1).  I don't have a suggested score for it (would love feedback  
there).  I ran it at .1 mostly to see how much it triggered and fp'd  
(not much, as it turns out.  I know this has been a problem in the  
past, so I'm wondering if the normalization code helps there, or I've  
just been lucky).  As noted, this has some rewrite bits coming when I  
get some time.



sa-harvest:
http://www.faisal.com/software/sa-harvest/
This is a script that does several obvious things and one possibly  
not-so-obvious thing:


- You configure it, telling it what your spam and ham folders are,  
and after that it will automatically train whenever you invoke it,  
without having to explicitly configure folders to scan (I find this  
useful for cron jobs, and less typing when I'm doing the same obvious  
thing every couple days).
- It also scans your ham boxes and automatically rebuilds your  
whitelist based on the contents of presumed food folders (this will  
mangle your user_prefs.  READ THE DOCS ON HOW THIS WORKS SO YOU DON'T  
LOSE OTHER SETTINGS.)


I've been using variants of this script since about a week after the  
first SA with training came out.  I finally generalized it a little  
last month, and have been running it nightly via cron ever since.



Feedback would be greatly appreciated.

-faisal



Re: Why is there so much hype behind Image spam

2006-07-16 Thread Faisal N Jawdat

On Jul 16, 2006, at 1:00 AM, John Andersen wrote:

And yet, in spite of your statistics, there is more spam than ever.
Some estimates are that in excess of 95% of all email is spam.


I'm unconvinced of this -- my spam load has leveled off at 200 per  
day.  On the order of 1 per week makes it into my inbox.  The latter  
is due to SA plus some additional code for better white-listing  
(which I'm planning to release as soon as I solve a couple issues).   
The former is entirely out of my control (and note that my email  
address is all over the place).


I suspect one or both of the following are true:

1. the volume of spam is linearly related to the number of email  
addresses out there.  the volume of ham is not:  the amount of ham is  
related to the amount of ham *sent* which follows an exponential  
distribution.  an increase in the number of users does not result in  
an proportional increase of the amount of ham, but does result in a  
proportional increase in the amount of spam.


2. spam will (or may have already) hit an economic equilibrium.  you  
could look at this as a supply and demand problem:  spam demand is  
the amount of people who are actually willing to buy things they get  
offers for in spam.  spam supply is the number of sellers who are  
willing to sell things via spam.  sending 200m messages still costs  
money (albeit very little of it), and sending 800m messages to get  
the same number of buys doesn't make sense for the spammer.  whether  
or not we have SA, Razor, etc., there comes a point where it isn't  
worth spammers at large  to send additional spam.  spam filtering  
increases the average cost of a sale to the seller, so the marginal  
revenue of a spam run has to be higher for the mailing to be worth it.


-faisal



tsa vs., well, us

2005-08-12 Thread Faisal N Jawdat

TSA wants people to turn off their spam filters:

http://www.schneier.com/blog/archives/2005/08/tsa_and_spam.html

i'd suggest some sort of tsa whitelist rule, but i'm guessing if that  
happens i'll soon start seeing mail from the tsa's department of  
herbal viagra.


-faisal



Re: annoying changes in 3.0

2005-01-08 Thread Faisal N. Jawdat
I think changing the config options from release to release makes it 
very hard for sites larger than a couple users to upgrade, because you 
have to rely on every user updating their config when you upgrade.  The 
UPGRADE file helps with this somewhat, but it's more geared for 
admins than users.

-faisal


Re: ver 3.0 opinions

2004-10-31 Thread Faisal N. Jawdat
Looks like somebody didn't read the UPGRADE doc...
 Due to the database format change, you will want to do something like
  this when upgrading:
I read it and followed the directions and didn't see any problem for a 
couple days and then suddenly the spam level jumped substantially.  
Upon further investigation it looked like the bayes dbs had gotten 
corrupted and I was seeing low tok counts like the original poster had 
reported.  Another friend saw this happen when he first upgraded, 
although I can't speak for his direction-following on the upgrade.

If I see more of this I'll try and isolate it and file an appropriately 
detailed bug report, but I don't think it's necessarily accurate to 
immediately write this off as user error.

-faisal