Re: rules better than bayes?
Dallas L. Engelken wrote: -Original Message- From: Jim Maul [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 18, 2006 8:55 AM To: users@spamassassin.apache.org Subject: Re: rules better than bayes? Dallas L. Engelken wrote: -Original Message- From: Jim Maul [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 11, 2006 1:49 PM To: Chris Lear Cc: users@spamassassin.apache.org Subject: Re: rules better than bayes? Chris Lear wrote: * Jim Maul wrote (11/01/06 17:48): [...] i dont have any sa-stats.pl on my system, and i recall some confusion with different scripts named the same thing so im not sure. If you can provide me with a location to obtain the sa-stats.pl script you are talking about i'll try to give it a run when i get some time. Im running 2.64 through qmail-scanner if it matters. Here's a version of sa-stats that works. I remember having a hard time finding it, so hopefully this saves you some effort. I've edited this line: if (!defined $FILE) { $FILE='^spamd$' } # regex but it's overridable on the commandline anyway. Chris #!/usr/bin/perl # file: sa-stats.pl # date: 2005-07-27 # version: 0.9 # author: Dallas Engelken <[EMAIL PROTECTED]> # desc: SA 3.x log parser This appears to be for 3.x (the description above). Will this work for 2.64 which im still running? Is there a working version somewhere that will? Tell ya truth, I don't even know if it works on 2.64. It was created after 3.0 was released. If your SA logs to maillog, just run it and find out. If you see data, it does... It doesn't take long to test this perl script because it doesn't have any prereqs that wouldn't already be on a SA installed box. There is also http://www.rulesemporium.com/programs/sa-stats-1.0.txt for 3.1.x which supports per-domain and per-user stats... But that's just FYI. Dallas This doesnt work for 2.64 by the way. Its looking for result= and scantime= and various other things which arent in my spamd log. My log entries look like: Jan 18 09:51:30 external spamd[2783]: connection from localhost [127.0.0.1] at port 39076 Jan 18 09:51:30 external spamd[16806]: processing message <[EMAIL PROTECTED] ro.us> for [EMAIL PROTECTED]:512. Jan 18 09:51:31 external spamd[16806]: clean message (-4.9/5.0) for [EMAIL PROTECTED]:512 in 1.7 seconds, 3128 bytes. Thanks anyway for the help, Jim Should be fairly simple to modify the regex to work with 2.64 and then adjust a couple values that don't apply. Is it impossible to upgrade your SA install? Dallas Its not impossible but im in the process of setting up a new machine running new versions of everything so im avoiding upgrading anything that isnt absolutely necessary. The current machine is only running RH9 so im starting fresh with a new server which will be running the newest of SA. Hopefully i can still keep my old bayes DB succesfully and run the stats off of that when the time comes. Its still a couple weeks away as i just cant find enough time to finish building this machine. Thanks for everything -Jim
RE: rules better than bayes?
> -Original Message- > From: Jim Maul [mailto:[EMAIL PROTECTED] > Sent: Wednesday, January 18, 2006 8:55 AM > To: users@spamassassin.apache.org > Subject: Re: rules better than bayes? > > Dallas L. Engelken wrote: > >> -Original Message- > >> From: Jim Maul [mailto:[EMAIL PROTECTED] > >> Sent: Wednesday, January 11, 2006 1:49 PM > >> To: Chris Lear > >> Cc: users@spamassassin.apache.org > >> Subject: Re: rules better than bayes? > >> > >> Chris Lear wrote: > >>> * Jim Maul wrote (11/01/06 17:48): > >>> [...] > >>>> i dont have any sa-stats.pl on my system, and i recall > >> some confusion > >>>> with different scripts named the same thing so im not > >> sure. If you > >>>> can provide me with a location to obtain the sa-stats.pl > >> script you > >>>> are talking about i'll try to give it a run when i get > >> some time. Im > >>>> running 2.64 through qmail-scanner if it matters. > >>> Here's a version of sa-stats that works. I remember having > >> a hard time > >>> finding it, so hopefully this saves you some effort. > >>> I've edited this line: > >>> if (!defined $FILE) { $FILE='^spamd$' } # regex but it's > >> overridable > >>> on the commandline anyway. > >>> > >>> Chris > >>> > >>> > >>> #!/usr/bin/perl > >>> > >>> # file: sa-stats.pl > >>> # date: 2005-07-27 > >>> # version: 0.9 > >>> # author: Dallas Engelken <[EMAIL PROTECTED]> # desc: SA 3.x > >> log parser > >> This appears to be for 3.x (the description above). Will > this work > >> for > >> 2.64 which im still running? Is there a working version somewhere > >> that will? > >> > > > > Tell ya truth, I don't even know if it works on 2.64. It > was created > > after 3.0 was released. If your SA logs to maillog, just > run it and > > find out. If you see data, it does... It doesn't take long to test > > this perl script because it doesn't have any prereqs that wouldn't > > already be on a SA installed box. > > > > There is also > http://www.rulesemporium.com/programs/sa-stats-1.0.txt > > for 3.1.x which supports per-domain and per-user stats... > But that's > > just FYI. > > > > Dallas > > > > > > This doesnt work for 2.64 by the way. Its looking for > result= and scantime= and various other things which arent in > my spamd log. My log entries look like: > > Jan 18 09:51:30 external spamd[2783]: connection from > localhost [127.0.0.1] at port 39076 Jan 18 09:51:30 external > spamd[16806]: processing message > <[EMAIL PROTECTED] ro.us> for [EMAIL PROTECTED]:512. > Jan 18 09:51:31 external spamd[16806]: clean message (-4.9/5.0) for > [EMAIL PROTECTED]:512 in 1.7 seconds, 3128 bytes. > > Thanks anyway for the help, > > Jim > Should be fairly simple to modify the regex to work with 2.64 and then adjust a couple values that don't apply. Is it impossible to upgrade your SA install? Dallas
Re: rules better than bayes?
Dallas L. Engelken wrote: -Original Message- From: Jim Maul [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 11, 2006 1:49 PM To: Chris Lear Cc: users@spamassassin.apache.org Subject: Re: rules better than bayes? Chris Lear wrote: * Jim Maul wrote (11/01/06 17:48): [...] i dont have any sa-stats.pl on my system, and i recall some confusion with different scripts named the same thing so im not sure. If you can provide me with a location to obtain the sa-stats.pl script you are talking about i'll try to give it a run when i get some time. Im running 2.64 through qmail-scanner if it matters. Here's a version of sa-stats that works. I remember having a hard time finding it, so hopefully this saves you some effort. I've edited this line: if (!defined $FILE) { $FILE='^spamd$' } # regex but it's overridable on the commandline anyway. Chris #!/usr/bin/perl # file: sa-stats.pl # date: 2005-07-27 # version: 0.9 # author: Dallas Engelken <[EMAIL PROTECTED]> # desc: SA 3.x log parser This appears to be for 3.x (the description above). Will this work for 2.64 which im still running? Is there a working version somewhere that will? Tell ya truth, I don't even know if it works on 2.64. It was created after 3.0 was released. If your SA logs to maillog, just run it and find out. If you see data, it does... It doesn't take long to test this perl script because it doesn't have any prereqs that wouldn't already be on a SA installed box. There is also http://www.rulesemporium.com/programs/sa-stats-1.0.txt for 3.1.x which supports per-domain and per-user stats... But that's just FYI. Dallas This doesnt work for 2.64 by the way. Its looking for result= and scantime= and various other things which arent in my spamd log. My log entries look like: Jan 18 09:51:30 external spamd[2783]: connection from localhost [127.0.0.1] at port 39076 Jan 18 09:51:30 external spamd[16806]: processing message <[EMAIL PROTECTED]> for [EMAIL PROTECTED]:512. Jan 18 09:51:31 external spamd[16806]: clean message (-4.9/5.0) for [EMAIL PROTECTED]:512 in 1.7 seconds, 3128 bytes. Thanks anyway for the help, Jim
RE: rules better than bayes?
> -Original Message- > From: Jim Maul [mailto:[EMAIL PROTECTED] > Sent: Wednesday, January 11, 2006 1:49 PM > To: Chris Lear > Cc: users@spamassassin.apache.org > Subject: Re: rules better than bayes? > > Chris Lear wrote: > > * Jim Maul wrote (11/01/06 17:48): > > [...] > >> i dont have any sa-stats.pl on my system, and i recall > some confusion > >> with different scripts named the same thing so im not > sure. If you > >> can provide me with a location to obtain the sa-stats.pl > script you > >> are talking about i'll try to give it a run when i get > some time. Im > >> running 2.64 through qmail-scanner if it matters. > > > > Here's a version of sa-stats that works. I remember having > a hard time > > finding it, so hopefully this saves you some effort. > > I've edited this line: > > if (!defined $FILE) { $FILE='^spamd$' } # regex but it's > overridable > > on the commandline anyway. > > > > Chris > > > > > > #!/usr/bin/perl > > > > # file: sa-stats.pl > > # date: 2005-07-27 > > # version: 0.9 > > # author: Dallas Engelken <[EMAIL PROTECTED]> # desc: SA 3.x > log parser > > > > This appears to be for 3.x (the description above). Will > this work for > 2.64 which im still running? Is there a working version > somewhere that will? > Tell ya truth, I don't even know if it works on 2.64. It was created after 3.0 was released. If your SA logs to maillog, just run it and find out. If you see data, it does... It doesn't take long to test this perl script because it doesn't have any prereqs that wouldn't already be on a SA installed box. There is also http://www.rulesemporium.com/programs/sa-stats-1.0.txt for 3.1.x which supports per-domain and per-user stats... But that's just FYI. Dallas
Re: rules better than bayes?
Chris Lear wrote: * Jim Maul wrote (11/01/06 17:48): [...] i dont have any sa-stats.pl on my system, and i recall some confusion with different scripts named the same thing so im not sure. If you can provide me with a location to obtain the sa-stats.pl script you are talking about i'll try to give it a run when i get some time. Im running 2.64 through qmail-scanner if it matters. Here's a version of sa-stats that works. I remember having a hard time finding it, so hopefully this saves you some effort. I've edited this line: if (!defined $FILE) { $FILE='^spamd$' } # regex but it's overridable on the commandline anyway. Chris #!/usr/bin/perl # file: sa-stats.pl # date: 2005-07-27 # version: 0.9 # author: Dallas Engelken <[EMAIL PROTECTED]> # desc: SA 3.x log parser This appears to be for 3.x (the description above). Will this work for 2.64 which im still running? Is there a working version somewhere that will? Thanks, -Jim
Re: rules better than bayes?
jdow wrote: From: "Jim Maul" <[EMAIL PROTECTED]> Chris Santerre wrote: > -Original Message- > From: jo3 [mailto:[EMAIL PROTECTED] > Sent: Monday, January 09, 2006 2:28 PM > To: users@spamassassin.apache.org > Subject: rules better than bayes? > > > Hi, > > This is an observation, please take it in the spirit in which it is > intended, it is not meant to be flame bait. > > After using spamassassin for six solid months, it seems to me > that the > bayes process (sa-learn [--spam | --ham]) has only very > limited success > in learning about new spam. Regardless of how many spams and > hams are > submitted, the effectiveness never goes above the default > level which, > in our case here, is somewhere around 2 out of 3 spams correctly > identified. By the same token, after adding the "third party" rule, > airmax.cf, the effectiveness went up to 99 out of 100 spams correctly > identified. I have long said that IMHO, I do not think bayes is worth it. Left unattended, it isn't as good. A simple rule can take out a lot of spam. Some may say rule writing is more complicated then training bayes. Maybe. Not so much the rule writing, but the figuring out what to look for and testing for FPs. I do not run Bayes for our company. Obviously I'm partial to URIBL.com and SARE rules ;) I get about 98% of spam caught, and little FPs. This is going to sound like tooting our own horn, but so be it. Before SARE, Bayes was cool. After SARE, I see no need. I always feel i have to point out the flip side to this just to offer another opinion. While i certainly dont have a NEED for bayes at our facility, i do run it, complete with autolearn. We have very low volume (5k msgs/day) but it works so well i rarely ever have to think about it. For us, 96% of the time bayes alone is enough to say whether a message is ham/spam. Add all the other tests on top of this (uribl, razor, a few sare, and theres easily a 20 point difference between ham and spam. Jim, can you back that up with a run of the SARE version of sa_stats.pl? I'd love to see your record with that setup for the highest and lowest ranking BAYES scores. {^_^} i dont have any sa-stats.pl on my system, and i recall some confusion with different scripts named the same thing so im not sure. If you can provide me with a location to obtain the sa-stats.pl script you are talking about i'll try to give it a run when i get some time. Im running 2.64 through qmail-scanner if it matters. -Jim
Re: rules better than bayes?
From: "Matt Kettler" <[EMAIL PROTECTED]> At 10:50 AM 1/10/2006, Chris Santerre wrote: I have long said that IMHO, I do not think bayes is worth it. Left unattended, it isn't as good. A simple rule can take out a lot of spam. Some may say rule writing is more complicated then training bayes. Maybe. Not so much the rule writing, but the figuring out what to look for and testing for FPs. Interesting.. For me, BAYES_99 is right between SURBL and URIBL in terms of hits. (And has 98.91% of URIBL's total hits) I find it completely indispensable. It's number 1 here on scoring spam, 83.22 for 0.05 of ham with "can't remember the last ham scoring on 99 that hit the spam folder." 99 has a score of 5 here because it does, all alone, tag spam that no other rule hits. XBL is the best BL here at the moment, 55.50% for 0.04% of hits on ham. I rarely train manually, except at initial setup where I feed it a good base learning. (the autolearner can sometimes go awry if you don't train some mail manually before letting it go.) I manually learn, particularly on spam not marked as spam that has a low BAYES score and some "meat in it." (I don't bother with content free spam. Those very quickly score higher due to BL hits that pop up like magic.) {^_^}
Re: rules better than bayes?
From: "Jim Maul" <[EMAIL PROTECTED]> Chris Santerre wrote: > -Original Message- > From: jo3 [mailto:[EMAIL PROTECTED] > Sent: Monday, January 09, 2006 2:28 PM > To: users@spamassassin.apache.org > Subject: rules better than bayes? > > > Hi, > > This is an observation, please take it in the spirit in which it is > intended, it is not meant to be flame bait. > > After using spamassassin for six solid months, it seems to me > that the > bayes process (sa-learn [--spam | --ham]) has only very > limited success > in learning about new spam. Regardless of how many spams and > hams are > submitted, the effectiveness never goes above the default > level which, > in our case here, is somewhere around 2 out of 3 spams correctly > identified. By the same token, after adding the "third party" rule, > airmax.cf, the effectiveness went up to 99 out of 100 spams correctly > identified. I have long said that IMHO, I do not think bayes is worth it. Left unattended, it isn't as good. A simple rule can take out a lot of spam. Some may say rule writing is more complicated then training bayes. Maybe. Not so much the rule writing, but the figuring out what to look for and testing for FPs. I do not run Bayes for our company. Obviously I'm partial to URIBL.com and SARE rules ;) I get about 98% of spam caught, and little FPs. This is going to sound like tooting our own horn, but so be it. Before SARE, Bayes was cool. After SARE, I see no need. I always feel i have to point out the flip side to this just to offer another opinion. While i certainly dont have a NEED for bayes at our facility, i do run it, complete with autolearn. We have very low volume (5k msgs/day) but it works so well i rarely ever have to think about it. For us, 96% of the time bayes alone is enough to say whether a message is ham/spam. Add all the other tests on top of this (uribl, razor, a few sare, and theres easily a 20 point difference between ham and spam. Jim, can you back that up with a run of the SARE version of sa_stats.pl? I'd love to see your record with that setup for the highest and lowest ranking BAYES scores. {^_^}
Re: rules better than bayes?
From: "Chris Santerre" <[EMAIL PROTECTED]> -Original Message- From: jo3 [mailto:[EMAIL PROTECTED] Hi, This is an observation, please take it in the spirit in which it is intended, it is not meant to be flame bait. After using spamassassin for six solid months, it seems to me that the bayes process (sa-learn [--spam | --ham]) has only very limited success in learning about new spam. Regardless of how many spams and hams are submitted, the effectiveness never goes above the default level which, in our case here, is somewhere around 2 out of 3 spams correctly identified. By the same token, after adding the "third party" rule, airmax.cf, the effectiveness went up to 99 out of 100 spams correctly identified. I have long said that IMHO, I do not think bayes is worth it. Left unattended, it isn't as good. A simple rule can take out a lot of spam. Some may say rule writing is more complicated then training bayes. Maybe. Not so much the rule writing, but the figuring out what to look for and testing for FPs. I do not run Bayes for our company. Obviously I'm partial to URIBL.com and SARE rules ;) I get about 98% of spam caught, and little FPs. This is going to sound like tooting our own horn, but so be it. Before SARE, Bayes was cool. After SARE, I see no need. Autolearning Bayes is not really very good based on what people here seem to say. I do note that I raised by BAYES_99 score to 5. If BAYES_99 hits the odds that the message is spam are so high that it's silly to give BAYES_99 a low score, theoretical nonsense notwithstanding. If you apply the wrong statistical theory with the wrong conceptual criteria the math or theory may be good but the results are trash. For an existing spam database the rule setup that exists is probably quite good. If 99 hits then other rules probably hit as well. This leads to artificially lowering the 99 score. Then when a new technique hits that Bayes can recognize but nothing else does comes along the message floats on through. At least on this system 99 misses once in 2000 to 1 times. Most of those times other very light whitelisting rules let the messages come through. Probably the right score for more general use would be 4.95 or something such that if any other rule hits it's dinged as spam. It depends on your spam tolerance compared to your tolerance for sorting spam by score and looking at the few that are marginal. Anyway, making that ONE change made the already good results I was getting with SARE and BAYES combined quite a bit better. Missed spam went down almost a factor of 10 and tagged ham went up by about 1 in 10,000 or less. (I can't remember the last time I got a ham marked as spam on the sole basis of BAYES_99 with a score of 5 that I had to fetch out of the spam folder.) I take this as a proof of concept that penalizing a rule for being too good is ridiculous on its face, statistical theories notwithstanding. I maintain this is a positive indication that either the criteria, the chosen statistical approach, or both are wrong. It might be entertaining to setup "stock" BAYES on your system, Chris, with all BAYES scores being very very low, 0.01 or something. Then run the SARE version of sa_stats.pl to see what the "goodness" of each BAYES level really is. From that you can guesstimate some scores that might improve your system. I'd be really interested to see what the autolearn BAYES really can perform like when it's used in your sort of environment. I know for my environment it's silly to use it due to the automated mis-learning on marginal messages. (Either it learns wrong or not at all on the most critical portions of the email load, the marginal messages.) {^_^} Joanne steps down off her soapbox yet again.
Re: rules better than bayes?
Chris Santerre a écrit : > > I have long said that IMHO, I do not think bayes is worth it. Left > unattended, it isn't as good. A simple rule can take out a lot of spam. Some > may say rule writing is more complicated then training bayes. Maybe. Not so > much the rule writing, but the figuring out what to look for and testing for > FPs. > > I do not run Bayes for our company. Obviously I'm partial to URIBL.com and > SARE rules ;) I get about 98% of spam caught, and little FPs. > > This is going to sound like tooting our own horn, but so be it. Before SARE, > Bayes was cool. After SARE, I see no need. I think SARE and bayes are complementary: - sare will detect new spam once ninjas have found the corresponding rules. - bayes will detect new spam if it resembles previous spam. That said, I don't use SA/Bayes (I use dspam on a per-user basis, while SA is site-wide).
Re: rules better than bayes? Hamtrap learning.
Andrew Donkin wrote: > Matt Kettler <[EMAIL PROTECTED]> writes: > > >>if [ -f /var/spool/mail/spamtrap ]; then >> echo learning spam mailbox - spamtrap >> mv /var/spool/mail/spamtrap . >> /usr/bin/sa-learn --spam --mbox spamtrap >> rm spam/spamtrap.alearn5.gz >> mv spam/spamtrap.alearn4.gz spam/spamtrap.alearn5.gz >> mv spam/spamtrap.alearn3.gz spam/spamtrap.alearn4.gz >> mv spam/spamtrap.alearn2.gz spam/spamtrap.alearn3.gz >> gzip spam/spamtrap.alearn1 >> mv spam/spamtrap.alearn1.gz spam/spamtrap.alearn2.gz >> >> mv spamtrap spam/spamtrap.alearn1 >>fi > > > I'll put my Captain Pedantic hat on and point out that if your MTA is > writing to /var/spool/mail/spamtrap at the time that you learn it, > which is quite possible if /var/spool/training/ is on the same > filesystem as /var/spool/mail/, sa-learn may end up chewing on a > half-finished message. Actually, they're on separate filesystems. But you're right, I forgot that mv can "move" a file within a filesystem and another process can still write to it with an old file descriptor.
Re: rules better than bayes? Certainly better than mine.
Andrew Donkin wrote: Jim Maul <[EMAIL PROTECTED]> writes: NOTE: to operate in this fashion i believe it is imperative that you change the autolearn thresholds. The defaults are dangerous! (atleast in 2.64 which i still run). I have mine set as such: bayes_auto_learn_threshold_nonspam -0.1 bayes_auto_learn_threshold_spam 10.0 Matt agreed. Aaron was going to change to something similar. Before reading this thread, I did the opposite. I changed my nonspam threshold from -0.2 to the default 0.1 because Bayes I thought (mistakenly perhaps) that the Bayes database's spam:ham ratio was far too high. Incoming mail is about 3:1, but the Bayes database was more like 20:1. See: 3 bayes db version 1491805 nspam 75795 nham 1081029 ntokens 1136779207 oldest atime 1136925099 newest atime 1136925026 last journal sync atime 1136838312 last expiry atime 43200 last expire atime delta 25087 last expire reduction count I started autolearning with the defaults and then quickly changed my thresholds as mentioned before. Our server here doesnt see a lot of spam (hell it doesnt even see a lot of mail total) so our ratios are obviously going to be different. Mine shows: 2 0 non-token data: bayes db version 26378 0 non-token data: nspam 54313 0 non-token data: nham 147479 0 non-token data: ntokens 1134172970 0 non-token data: oldest atime 1136925620 0 non-token data: newest atime 1136925554 0 non-token data: last journal sync atime 1136232703 0 non-token data: last expiry atime 2060396 0 non-token data: last expire atime delta 34608 0 non-token data: last expire reduction count In particular, a message from James Keating of this list received this report from Bayes: X-Spam-Bayes-ham: 0.011-8--5h-0s--19d--SpamAssassin, 0.026-3--2h-0s--19d--autolearn, 0.029-203--156h-39s--19d--5.0, 0.031-7--5h-1s--19d--spamassassin, 0.050-4162--3796h-1707s--0d--i'm X-Spam-Bayes-spam: 1.000-149--0h-6920s--1d--HX-Accept-Language:en-us, 1.000-27--0h-1229s--18d--H*UA:Thunderbird, 1.000-24--0h-1083s--18d--H*u:Thunderbird, 1.000-16--0h-718s--0d--H*RU:sk:cpe-24-, 1.000-13--0h-594s--11d--H*r:sk:cpe-24- ...implying that "User-agent: Thunderbird" was in a thousand spams but no hams. And that "Accept-Language:en-us" was in 6900 spams and no hams. ! So, I'm thinking that my Bayes is hosed again. Will a hamtrap help me here? Im not sure, i've never seen this report before and i certainly dont have the same message to compare what it scored on my system here. Have you noticed bayes misclassifying messages because of this, or are you speaking theoretically? A huge ratio alone does not imply a problem, its the results that matter. I'm CCing you, Jim, because my last two posts to the list vanished without a trace. Not a problem. Just not sure how much help i am in this situation... -Jim
Re: rules better than bayes?
Good evening, Justin, all, On Tue, 10 Jan 2006, Justin Mason wrote: -(Modified PGP heading)- Hash: SHA1 Matt Kettler writes: At 10:50 AM 1/10/2006, Chris Santerre wrote: I have long said that IMHO, I do not think bayes is worth it. Left unattended, it isn't as good. A simple rule can take out a lot of spam. Some may say rule writing is more complicated then training bayes. Maybe. Not so much the rule writing, but the figuring out what to look for and testing for FPs. Interesting.. For me, BAYES_99 is right between SURBL and URIBL in terms of hits. (And has 98.91% of URIBL's total hits) I find it completely indispensable. The thing is, Bayes is a tool for personalization -- and as such, its effectiveness varies widely depending on what *you* do with it. For what it's worth, I've *never* trained my current Bayes DB, and have been running with it for about 6 months I think. I get BAYES_00 on most ham, and BAYES_99 on most spam. But the 4 letters that matter with Bayes are: YMMV Isn't that an OTCBB Ticker symbol? I heard they're about to go through the _roof_!! /me ducks... Cheers, - Bill --- "We don't want an election without a paper trail...all three owners of the companies who make these machines are donors to the Bush administration. Is this not corruption?" -- Gore Vidal (Courtesy of http://www.laweekly.com/ink/03/52/features-cooper.php) -- William Stearns ([EMAIL PROTECTED]). Mason, Buildkernel, freedups, p0f, rsync-backup, ssh-keyinstall, dns-check, more at: http://www.stearns.org --
Re: rules better than bayes?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Matt Kettler writes: > At 10:50 AM 1/10/2006, Chris Santerre wrote: > > >I have long said that IMHO, I do not think bayes is worth it. Left > >unattended, it isn't as good. A simple rule can take out a lot of spam. > >Some may say rule writing is more complicated then training bayes. Maybe. > >Not so much the rule writing, but the figuring out what to look for and > >testing for FPs. > > Interesting.. For me, BAYES_99 is right between SURBL and URIBL in terms of > hits. (And has 98.91% of URIBL's total hits) I find it completely > indispensable. The thing is, Bayes is a tool for personalization -- and as such, its effectiveness varies widely depending on what *you* do with it. For what it's worth, I've *never* trained my current Bayes DB, and have been running with it for about 6 months I think. I get BAYES_00 on most ham, and BAYES_99 on most spam. But the 4 letters that matter with Bayes are: YMMV ;) - --j. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Exmh CVS iD8DBQFDxAWfMJF5cimLx9ARAvvfAJwIiQQpAzBPYNEKnQiWLw4NMmxZewCfTxEg qquh5FGGGQFwFU6TdOlIDi0= =CcrR -END PGP SIGNATURE-
Re: rules better than bayes?
Bayes would be much good if not for the rules to create a basic compass as to what is spam and not spam. The rules in a large part is what makes bayes work.
RE: rules better than bayes?
> Im not matt, but running a very similar setup which works > very well so i thought i would comment also. Im running a > single sitewide database. > All mail is processed under my spamd user. OK, that's basically what I'm doing too. > > I rarely train manually as well. > NOTE: to operate in this fashion i believe it is imperative that you > change the autolearn thresholds. The defaults are dangerous! > (atleast > in 2.64 which i still run). I have mine set as such: > > bayes_auto_learn_threshold_nonspam -0.1 > bayes_auto_learn_threshold_spam 10.0 > OK, Matt said something similar about the thresholds. Mine are default so that may be part of the issue. Thanks for the feedback! -Aaron
RE: rules better than bayes?
> Erm, that really shouldn't affect the bayes autolearner.. > perhaps you are > thinking of the AWL? I don't run the AWL for this very reason. > Oh yeah. I was thinking of the AWL. NM. > The problem is this requires some customization. This can't > be a default setup > of SA as the "catch phrases" vary from place to place, and if > there was a > default set of them spammers would be sure to always include > them, making them > pointless. You'd effectively have the same thing as the > current default, by > avoiding spam rules and existing bayes tokens they can get a > message learned. > That all makes sense. I'll give it a shot. Thanks! -Aaron
Re: rules better than bayes?
Aaron Grewell wrote: The trouble I had with the autolearner was that some spammers would send innocuous mail through to raise their scores until Bayes decided they were ok, then start spamming. That was a couple of versions back, does that sort of thing no longer work? Are you sure this is Bayes-related? Bayes looks at the entire message, not just the sender. All I'd expect this tactic to do would be to make future innocuous mail look more innocuous -- it shouldn't have any significant impact on spammy mail from the same source since the content will be different. -- Kelson Vibber SpeedGate Communications,
Re: rules better than bayes?
Aaron Grewell wrote: Hi Matt, I'm interested in how your setup compares to mine. I also find Bayes very useful, but I haven't gotten it to work as well as what you've described. Interesting.. For me, BAYES_99 is right between SURBL and URIBL in terms of hits. (And has 98.91% of URIBL's total hits) I find it completely indispensable. Are you using a single site-wide database, or is this a per-user setup? Im not matt, but running a very similar setup which works very well so i thought i would comment also. Im running a single sitewide database. All mail is processed under my spamd user. I rarely train manually, except at initial setup where I feed it a good base learning. (the autolearner can sometimes go awry if you don't train some mail manually before letting it go.) The trouble I had with the autolearner was that some spammers would send innocuous mail through to raise their scores until Bayes decided they were ok, then start spamming. That was a couple of versions back, does that sort of thing no longer work? I rarely train manually as well. The only ones i train (and its only because there is nothing else to train) are spam which are correctly identified as such but have autolearn=no because they did not meet the autolearn criteria. These almost always have BAYES_99 and a score of 20 or so but most likely did not have enough header points to autolearn it. I didnt even start training my database manually. I started from scratch and let the autolearner do its thing. I have never had to correct what it did because it was always always right. The poison that spammers like to include in messages doesnt appear to have any affect on the overall outcome of the bayes score. I dont really know why this is, it just works. NOTE: to operate in this fashion i believe it is imperative that you change the autolearn thresholds. The defaults are dangerous! (atleast in 2.64 which i still run). I have mine set as such: bayes_auto_learn_threshold_nonspam -0.1 bayes_auto_learn_threshold_spam 10.0 To this date (been running over 2 years) i have yet to see the autolearner misclassify. Most bayes hits are the far extremes (bayes_99 and bayes_0) with only a few in the 80-90 range. On a day to day basis I mostly feed automatically with a cronjob that collects mail via spamtraps and hamtraps. I have that coupled with autolearning that's set a bit differently than the defaults. (IMNSHO, having a ham learning threshold that's positive is suicide, but I also have a large number of small negative-score rules so I can keep my threshold at -0.01 and actually autolearn some ham). I'd love to make my Bayesian database more effective, is there a doc somewhere that describes how you tuned it to your environment? I doubt there is anything that specific and if there was, it most likely wouldnt help you in your situation. There are general tuning notes on the SA website and such but you really just have to try and see what works and what doesnt in your setup. What works well for 1 person may not work at all for someone else. -Jim
Re: rules better than bayes?
Aaron Grewell wrote: > Hi Matt, I'm interested in how your setup compares to mine. I also find > Bayes very useful, but I haven't gotten it to work as well as what > you've described. > > >>Interesting.. For me, BAYES_99 is right between SURBL and >>URIBL in terms of >>hits. (And has 98.91% of URIBL's total hits) I find it completely >>indispensable. >> > > > Are you using a single site-wide database, or is this a per-user setup? Single site-wide.. I use mailscanner which does not support per-user, but I'm not really looking for it. > > >>I rarely train manually, except at initial setup where I feed >>it a good >>base learning. (the autolearner can sometimes go awry if you >>don't train >>some mail manually before letting it go.) >> > > > The trouble I had with the autolearner was that some spammers would send > innocuous mail through to raise their scores until Bayes decided they > were ok, then start spamming. That was a couple of versions back, does > that sort of thing no longer work? Erm, that really shouldn't affect the bayes autolearner.. perhaps you are thinking of the AWL? I don't run the AWL for this very reason. > >>On a day to day basis I mostly feed automatically with a cronjob that >>collects mail via spamtraps and hamtraps. I have that coupled with >>autolearning that's set a bit differently than the defaults. (IMNSHO, >>having a ham learning threshold that's positive is suicide, >>but I also have >>a large number of small negative-score rules so I can keep my >>threshold at >>-0.01 and actually autolearn some ham). >> > > > I'd love to make my Bayesian database more effective, is there a doc > somewhere that describes how you tuned it to your environment? Not really.. but it's not hard. Spamtraps and hamtraps: --- 1) create a secret "hamtrap" email account. Subscribe this account to newsletters and news feeds that your users typically subscribe to. Do not post this address around, and don't use "hamtrap" as the account name, it's too obvious. 2) create a "spamtrap" account, or several of them. Carefully seed this out in the body of some Usenet and mailing list postings. 3) create a cron-job that auto-feeds the above mail to sa-learn. Simple example fragment of the script I use (it keeps a rotating archive of the past 5 learning sessions): #!/bin/sh cd /var/spool/training/ if [ -f /var/spool/mail/spamtrap ]; then echo learning spam mailbox - spamtrap mv /var/spool/mail/spamtrap . /usr/bin/sa-learn --spam --mbox spamtrap rm spam/spamtrap.alearn5.gz mv spam/spamtrap.alearn4.gz spam/spamtrap.alearn5.gz mv spam/spamtrap.alearn3.gz spam/spamtrap.alearn4.gz mv spam/spamtrap.alearn2.gz spam/spamtrap.alearn3.gz gzip spam/spamtrap.alearn1 mv spam/spamtrap.alearn1.gz spam/spamtrap.alearn2.gz mv spamtrap spam/spamtrap.alearn1 fi 4) Carefully monitor the data being fed for a while (two weeks or so) to make sure there's no pollution. After it's established you can monitor it less often. Autolearn adjustment: 1) add bayes_auto_learn_threshold_nonspam -0.01 to your local.cf 2) create a "bayes_hamlearning.cf" file. Create several simple body text rules with "catch phrases" from your normal nonspam. Assign these rules very small negative scores (-0.01 to -0.1). This is generally easier in a corporate environment, but it can be done in academic too. body LOCAL_THESIS /\bThesis\b/i score LOCAL_THESIS -0.01 You have to keep the scores small, as you don't want to use these to whitelist spam mail. You merely want to make mail that would otherwise score 0 earn a small negative score if it's got some of these phrases in it. It's not perfect, but it's better than blindly learning everything under 0.5. I feel learning as ham should be earned, not a default for not hitting any rules at all. The problem is this requires some customization. This can't be a default setup of SA as the "catch phrases" vary from place to place, and if there was a default set of them spammers would be sure to always include them, making them pointless. You'd effectively have the same thing as the current default, by avoiding spam rules and existing bayes tokens they can get a message learned.
RE: rules better than bayes?
Hi Matt, I'm interested in how your setup compares to mine. I also find Bayes very useful, but I haven't gotten it to work as well as what you've described. > > Interesting.. For me, BAYES_99 is right between SURBL and > URIBL in terms of > hits. (And has 98.91% of URIBL's total hits) I find it completely > indispensable. > Are you using a single site-wide database, or is this a per-user setup? > I rarely train manually, except at initial setup where I feed > it a good > base learning. (the autolearner can sometimes go awry if you > don't train > some mail manually before letting it go.) > The trouble I had with the autolearner was that some spammers would send innocuous mail through to raise their scores until Bayes decided they were ok, then start spamming. That was a couple of versions back, does that sort of thing no longer work? > On a day to day basis I mostly feed automatically with a cronjob that > collects mail via spamtraps and hamtraps. I have that coupled with > autolearning that's set a bit differently than the defaults. (IMNSHO, > having a ham learning threshold that's positive is suicide, > but I also have > a large number of small negative-score rules so I can keep my > threshold at > -0.01 and actually autolearn some ham). > I'd love to make my Bayesian database more effective, is there a doc somewhere that describes how you tuned it to your environment?
RE: rules better than bayes?
At 10:50 AM 1/10/2006, Chris Santerre wrote: I have long said that IMHO, I do not think bayes is worth it. Left unattended, it isn't as good. A simple rule can take out a lot of spam. Some may say rule writing is more complicated then training bayes. Maybe. Not so much the rule writing, but the figuring out what to look for and testing for FPs. Interesting.. For me, BAYES_99 is right between SURBL and URIBL in terms of hits. (And has 98.91% of URIBL's total hits) I find it completely indispensable. I rarely train manually, except at initial setup where I feed it a good base learning. (the autolearner can sometimes go awry if you don't train some mail manually before letting it go.) On a day to day basis I mostly feed automatically with a cronjob that collects mail via spamtraps and hamtraps. I have that coupled with autolearning that's set a bit differently than the defaults. (IMNSHO, having a ham learning threshold that's positive is suicide, but I also have a large number of small negative-score rules so I can keep my threshold at -0.01 and actually autolearn some ham). This setup is near zero maintenance, and highly effective. I can't see why it wouldn't be "worth it". It's almost as good as turning on URIBLs and not much more work. It's certainly much less work than rule writing. The last time I bothered to tinker with my bayes was before Christmas.
RE: rules better than bayes?
Title: RE: rules better than bayes? > I always feel i have to point out the flip side to this just to offer > another opinion. And I love ya for it ;) (In the kind of brotherly love one man can feel for another) > While i certainly dont have a NEED for bayes at our > facility, i do run it, complete with autolearn. We have very > low volume > (5k msgs/day) but it works so well i rarely ever have to > think about it. > For us, 96% of the time bayes alone is enough to say whether a > message is ham/spam. Add all the other tests on top of this (uribl, > razor, a few sare, and theres easily a 20 point difference > between ham > and spam. > > -Jim LOL, yeah. The average spam score from last year has gone up quite a lot! SARE noticibly has less things took look at as far as new tactics to cover with rules. Making a spam go from a score of 20 to 21, just doesn't seem a big deal :) --Chris
Re: rules better than bayes?
Chris Santerre wrote: > -Original Message- > From: jo3 [mailto:[EMAIL PROTECTED] > Sent: Monday, January 09, 2006 2:28 PM > To: users@spamassassin.apache.org > Subject: rules better than bayes? > > > Hi, > > This is an observation, please take it in the spirit in which it is > intended, it is not meant to be flame bait. > > After using spamassassin for six solid months, it seems to me > that the > bayes process (sa-learn [--spam | --ham]) has only very > limited success > in learning about new spam. Regardless of how many spams and > hams are > submitted, the effectiveness never goes above the default > level which, > in our case here, is somewhere around 2 out of 3 spams correctly > identified. By the same token, after adding the "third party" rule, > airmax.cf, the effectiveness went up to 99 out of 100 spams correctly > identified. I have long said that IMHO, I do not think bayes is worth it. Left unattended, it isn't as good. A simple rule can take out a lot of spam. Some may say rule writing is more complicated then training bayes. Maybe. Not so much the rule writing, but the figuring out what to look for and testing for FPs. I do not run Bayes for our company. Obviously I'm partial to URIBL.com and SARE rules ;) I get about 98% of spam caught, and little FPs. This is going to sound like tooting our own horn, but so be it. Before SARE, Bayes was cool. After SARE, I see no need. I always feel i have to point out the flip side to this just to offer another opinion. While i certainly dont have a NEED for bayes at our facility, i do run it, complete with autolearn. We have very low volume (5k msgs/day) but it works so well i rarely ever have to think about it. For us, 96% of the time bayes alone is enough to say whether a message is ham/spam. Add all the other tests on top of this (uribl, razor, a few sare, and theres easily a 20 point difference between ham and spam. -Jim
RE: rules better than bayes?
Title: RE: rules better than bayes? > -Original Message- > From: jo3 [mailto:[EMAIL PROTECTED]] > Sent: Monday, January 09, 2006 2:28 PM > To: users@spamassassin.apache.org > Subject: rules better than bayes? > > > Hi, > > This is an observation, please take it in the spirit in which it is > intended, it is not meant to be flame bait. > > After using spamassassin for six solid months, it seems to me > that the > bayes process (sa-learn [--spam | --ham]) has only very > limited success > in learning about new spam. Regardless of how many spams and > hams are > submitted, the effectiveness never goes above the default > level which, > in our case here, is somewhere around 2 out of 3 spams correctly > identified. By the same token, after adding the "third party" rule, > airmax.cf, the effectiveness went up to 99 out of 100 spams correctly > identified. I have long said that IMHO, I do not think bayes is worth it. Left unattended, it isn't as good. A simple rule can take out a lot of spam. Some may say rule writing is more complicated then training bayes. Maybe. Not so much the rule writing, but the figuring out what to look for and testing for FPs. I do not run Bayes for our company. Obviously I'm partial to URIBL.com and SARE rules ;) I get about 98% of spam caught, and little FPs. This is going to sound like tooting our own horn, but so be it. Before SARE, Bayes was cool. After SARE, I see no need. Chris Santerre SysAdmin and SARE/URIBL ninja http://www.uribl.com http://www.rulesemporium.com
Re: rules better than bayes?
Robert Bartlett writes: Ok I confused myself. Im sorry for being an idiot. I get it now. Everytime an email comes in it tries to access it as the user, since bayes is being feed to just the root account it doesn't see anything for the users in bayes. With the override I force it to use the root account for all emails coming in. Boy am I stupid. Thanks Robert Try out this to find the right value for bayes_sql_override_username. SELECT id, username, spam_count, ham_count, token_count FROM bayes_vars; - dhawal -Original Message- From: Robert Bartlett [mailto:[EMAIL PROTECTED] Sent: Monday, January 09, 2006 1:52 PM To: users@spamassassin.apache.org Subject: RE: rules better than bayes? Sorry for the confusion, I do use a site wide bayes database, I thought the information I sent below was the site wide information the system uses to access the bayes database. Thanks Robert -Original Message- From: Matt Kettler [mailto:[EMAIL PROTECTED] Sent: Monday, January 09, 2006 1:47 PM To: Robert Bartlett Cc: users@spamassassin.apache.org Subject: Re: rules better than bayes? Robert Bartlett wrote: This is what I have in my local.cf file: bayes_store_module Mail::SpamAssassin::BayesStore::SQL bayes_sql_dsnDBI:mysql:**:localhost:3306 bayes_sql_username bayes_sql_password Obviously I hid the data that I didn't want to show with *. When I run sa-learn it trains into the mysql database just fine, I assume SA connects to it just fine because of that. That's all the database login information. That doesn't mean you have a single sitewide bayes database. Again, I suggest looking at the bayes_sql_override_username option.
RE: rules better than bayes?
> -Original Message- > From: Matt Kettler [mailto:[EMAIL PROTECTED] > Sent: Monday, January 09, 2006 2:05 PM > To: Matthew Yette > Cc: users@spamassassin.apache.org > Subject: Re: rules better than bayes? > > [snip] > > I also strongly recommend enabling SA's URIBL support, and > adding on a .cf file to get uribl.com's list added in > (default SA only uses surbl.org lists) > > grep URIBL_BLACK /var/log/maillog |wc -l>2214 > yes, it gets lonely at the top sometimes... ;)BTW, we are looking for additional mirrors if anyone has rbldnsd and a few kb/s to spare... See www.uribl.com frontpage news for contact. Actually for me, bayes and razor are constantly the two best hitters.. uribl black comes in a close 3rd dallase
RE: rules better than bayes?
Ok I confused myself. Im sorry for being an idiot. I get it now. Everytime an email comes in it tries to access it as the user, since bayes is being feed to just the root account it doesn't see anything for the users in bayes. With the override I force it to use the root account for all emails coming in. Boy am I stupid. Thanks Robert -Original Message- From: Robert Bartlett [mailto:[EMAIL PROTECTED] Sent: Monday, January 09, 2006 1:52 PM To: users@spamassassin.apache.org Subject: RE: rules better than bayes? Sorry for the confusion, I do use a site wide bayes database, I thought the information I sent below was the site wide information the system uses to access the bayes database. Thanks Robert -Original Message- From: Matt Kettler [mailto:[EMAIL PROTECTED] Sent: Monday, January 09, 2006 1:47 PM To: Robert Bartlett Cc: users@spamassassin.apache.org Subject: Re: rules better than bayes? Robert Bartlett wrote: > This is what I have in my local.cf file: > > bayes_store_module Mail::SpamAssassin::BayesStore::SQL > bayes_sql_dsnDBI:mysql:**:localhost:3306 > bayes_sql_username > bayes_sql_password > > Obviously I hid the data that I didn't want to show with *. When I run > sa-learn it trains into the mysql database just fine, I assume SA > connects to it just fine because of that. That's all the database login information. That doesn't mean you have a single sitewide bayes database. Again, I suggest looking at the bayes_sql_override_username option.
RE: rules better than bayes?
Sorry for the confusion, I do use a site wide bayes database, I thought the information I sent below was the site wide information the system uses to access the bayes database. Thanks Robert -Original Message- From: Matt Kettler [mailto:[EMAIL PROTECTED] Sent: Monday, January 09, 2006 1:47 PM To: Robert Bartlett Cc: users@spamassassin.apache.org Subject: Re: rules better than bayes? Robert Bartlett wrote: > This is what I have in my local.cf file: > > bayes_store_module Mail::SpamAssassin::BayesStore::SQL > bayes_sql_dsnDBI:mysql:**:localhost:3306 > bayes_sql_username > bayes_sql_password > > Obviously I hid the data that I didn't want to show with *. When I run > sa-learn it trains into the mysql database just fine, I assume SA > connects to it just fine because of that. That's all the database login information. That doesn't mean you have a single sitewide bayes database. Again, I suggest looking at the bayes_sql_override_username option.
Re: rules better than bayes?
Robert Bartlett wrote: > This is what I have in my local.cf file: > > bayes_store_module Mail::SpamAssassin::BayesStore::SQL > bayes_sql_dsnDBI:mysql:**:localhost:3306 > bayes_sql_username > bayes_sql_password > > Obviously I hid the data that I didn't want to show with *. When I run > sa-learn it trains into the mysql database just fine, I assume SA connects > to it just fine because of that. That's all the database login information. That doesn't mean you have a single sitewide bayes database. Again, I suggest looking at the bayes_sql_override_username option.
RE: rules better than bayes?
This is what I have in my local.cf file: bayes_store_module Mail::SpamAssassin::BayesStore::SQL bayes_sql_dsnDBI:mysql:**:localhost:3306 bayes_sql_username bayes_sql_password Obviously I hid the data that I didn't want to show with *. When I run sa-learn it trains into the mysql database just fine, I assume SA connects to it just fine because of that. Robert -Original Message- From: Matt Kettler [mailto:[EMAIL PROTECTED] Sent: Monday, January 09, 2006 1:32 PM To: Robert Bartlett Cc: users@spamassassin.apache.org Subject: Re: rules better than bayes? Robert Bartlett wrote: > Interesting, I did that just to see how mine were doing and the BAYES > one returned 0? Does that mean bayes is not being used? I have been > feeding emails to bayes and in debug mode it shows bayes being used. I > am using bayes in a mysql. Just weird that its showing 0. > That sounds a lot like you're training bayes into mysql, but when mail comes in and gets scanned, it's either not using SQL, or it's not using the same table. Usually this is a problem with username, where your training is occurring as "root" but your scanning is occurring as "nobody". You might want to try using the bayes_sql_override_username option, to force a single site-wide bayes database, instead of having one per userid executing SA. (note: that's per userid EXECUTING SA.. not per email recipient.)
Re: rules better than bayes?
Robert Bartlett wrote: > Interesting, I did that just to see how mine were doing and the BAYES one > returned 0? Does that mean bayes is not being used? I have been feeding > emails to bayes and in debug mode it shows bayes being used. I am using > bayes in a mysql. Just weird that its showing 0. > That sounds a lot like you're training bayes into mysql, but when mail comes in and gets scanned, it's either not using SQL, or it's not using the same table. Usually this is a problem with username, where your training is occurring as "root" but your scanning is occurring as "nobody". You might want to try using the bayes_sql_override_username option, to force a single site-wide bayes database, instead of having one per userid executing SA. (note: that's per userid EXECUTING SA.. not per email recipient.)
RE: rules better than bayes?
wrote: > I have since taken bayes out as I get WAY better results without it. If it doesn't work for you, don't use it. The rules and network tests work pretty well. Especially if you add some SARE rules into the mix. However... > The reason this happens to me is that I get to many spam mailings > that poison the db and I end up with allot of spam that shows up as a > Bayes_00. That sounds like you have a poorly trained db. Did you do manual training or leave it up to the automatic training? There is really no such thing as bayes poison. There are only words that appear frequently in spam and words that don't appear frequently in spam. If the spammers drop a bunch of random garbage into their spam, that's just more stuff for bayes to analyze. Most likely, it will be stuff that you wouldn't normally see in your ham mails anyway. > I use all the Network tests but I get allot of spam that > has not been added yet. Network tests are good for spam runs that have been around for awhile. For newer spams, bayes and some of the more generic rules are where you will get most of your hits. -- Bowie
RE: rules better than bayes?
Interesting, I did that just to see how mine were doing and the BAYES one returned 0? Does that mean bayes is not being used? I have been feeding emails to bayes and in debug mode it shows bayes being used. I am using bayes in a mysql. Just weird that its showing 0. Robert -Original Message- From: Matt Kettler [mailto:[EMAIL PROTECTED] Sent: Monday, January 09, 2006 1:05 PM To: Matthew Yette Cc: users@spamassassin.apache.org Subject: Re: rules better than bayes? Matthew Yette wrote: > > Do you recommend running airmax as a supplementary ruleset with 3.1.0? I personally have no recommendations on it.. I've never run it. I personally like SARE's specific, evilnumbers, random and adult rulesets. Here's some quick grep's for hit-rates on some SARE rules I use (no declarations about FPs vs real spam hits, but none of these sets have caused me any problems so far) 70_sare_evilnum0.cf & 70_sare_evilnum1.cf: grep SARE_EN_ /var/log/maillog |wc -l 301 70_sare_specific.cf: grep SARE_SPEC_ /var/log/maillog |wc -l 60 70_sare_genlsubj0.cf: grep SARE_SUB /var/log/maillog |wc -l 44 70_sare_adult.cf: grep SARE_ADLT /var/log/maillog |wc -l 31 70_sare_uri0.cf: grep SARE_URI_ /var/log/maillog |wc -l 10 70_sare_random.cf: grep SARE_RAND_ /var/log/maillog |wc -l 1 I also strongly recommend enabling SA's URIBL support, and adding on a .cf file to get uribl.com's list added in (default SA only uses surbl.org lists) grep URIBL_BLACK /var/log/maillog |wc -l 2214 grep _SURBL /var/log/maillog |wc -l 2144 And of course I get great results from bayes: grep BAYES_99 /var/log/maillog |wc -l 2190 Ditto DCC and Razor2: grep RAZOR2_CHECK /var/log/maillog |wc -l 2114 grep DCC_CHECK /var/log/maillog |wc -l 1833
Re: rules better than bayes?
Matthew Yette wrote: > > Do you recommend running airmax as a supplementary ruleset with 3.1.0? I personally have no recommendations on it.. I've never run it. I personally like SARE's specific, evilnumbers, random and adult rulesets. Here's some quick grep's for hit-rates on some SARE rules I use (no declarations about FPs vs real spam hits, but none of these sets have caused me any problems so far) 70_sare_evilnum0.cf & 70_sare_evilnum1.cf: grep SARE_EN_ /var/log/maillog |wc -l 301 70_sare_specific.cf: grep SARE_SPEC_ /var/log/maillog |wc -l 60 70_sare_genlsubj0.cf: grep SARE_SUB /var/log/maillog |wc -l 44 70_sare_adult.cf: grep SARE_ADLT /var/log/maillog |wc -l 31 70_sare_uri0.cf: grep SARE_URI_ /var/log/maillog |wc -l 10 70_sare_random.cf: grep SARE_RAND_ /var/log/maillog |wc -l 1 I also strongly recommend enabling SA's URIBL support, and adding on a .cf file to get uribl.com's list added in (default SA only uses surbl.org lists) grep URIBL_BLACK /var/log/maillog |wc -l 2214 grep _SURBL /var/log/maillog |wc -l 2144 And of course I get great results from bayes: grep BAYES_99 /var/log/maillog |wc -l 2190 Ditto DCC and Razor2: grep RAZOR2_CHECK /var/log/maillog |wc -l 2114 grep DCC_CHECK /var/log/maillog |wc -l 1833
Re: rules better than bayes?
Do you recommend running airmax as a supplementary ruleset with 3.1.0? This is just my humble opinion, but I don't know if that's a ruleset I would use in production for a multi-user server. A few of the rules use the "f-word" in the rule description line, so it would go out in a verbose report. The rules seem pretty random and unfocused, and scored based on gut instinct rather than rigorous testing.
Re: rules better than bayes?
Matthew Yette wrote: Correction, airmax.cf is not one single rule, it's one single FILE containing 211 rules. That's a significant difference, given that the stock spamassassin 3.1.0 has about 723 rules. Airmax has increased the number of rules in your system by 29.1% Do you recommend running airmax as a supplementary ruleset with 3.1.0? There's an additional downside to airmax. It has excerpts from *lots* of SARE rules. If a SARE rule gets updated, will it be updated in airmax.cf? YMMV, M -- Overflow on /dev/null; please empty the bit bucket. 14:50:01 up 1 day, 10:21, 5 users, load average: 0.04, 0.14, 0.11 Linux Registered User #241685 http://counter.li.org
Re: rules better than bayes?
Matt Kettler a écrit : > > > Realistically, I don't know why your hit rates are so low. They shouldn't be > so > bad that you're only detecting 2 or 3 out of every hundred. > > You could have some configuration problems, but I can't tell as you've not > told > us anything about your system, just the problems you have. > > Can you answer a few questions that might help us diagnose some of your > problems: > > What version of SA are you running? > > Can you post an X-Spam-Status header for one of the false negatives? > > Is any of your spam hitting ALL_TRUSTED? > > What BAYES rules are these messages hitting before and after training? > > Do you use any network checks (URIBLs, RBLs, DCC, Razor, Pyzor, SPF)? > also, a common error is to run SA as a user, but train it as another one.
Re: rules better than bayes?
I have since taken bayes out as I get WAY better results without it. The reason this happens to me is that I get to many spam mailings that poison the db and I end up with allot of spam that shows up as a Bayes_00. I use all the Network tests but I get allot of spam that has not been added yet. - Original Message - From: "jo3" <[EMAIL PROTECTED]> To: Sent: Monday, January 09, 2006 12:27 PM Subject: rules better than bayes? | Hi, | | This is an observation, please take it in the spirit in which it is | intended, it is not meant to be flame bait. | | After using spamassassin for six solid months, it seems to me that the | bayes process (sa-learn [--spam | --ham]) has only very limited success | in learning about new spam. Regardless of how many spams and hams are | submitted, the effectiveness never goes above the default level which, | in our case here, is somewhere around 2 out of 3 spams correctly | identified. By the same token, after adding the "third party" rule, | airmax.cf, the effectiveness went up to 99 out of 100 spams correctly | identified. | | So far, we have not had a single ham misidentified as spam with over one | million messages examined. | | Throughout the documentation, there seems to be a bias toward the bayes | filter rather than the rule system. Does anyone on the list have some | thoughts which would help to explain my observation as to why a single | rule would appear so successful while a million spams and hams would | have so little effect? | | Thank you, | Jo3 | |
Re: rules better than bayes?
On 1/9/06 2:43 PM, "Matt Kettler" <[EMAIL PROTECTED]> wrote: > jo3 wrote: >> Hi, >> >> This is an observation, please take it in the spirit in which it is >> intended, it is not meant to be flame bait. >> >> After using spamassassin for six solid months, it seems to me that the >> bayes process (sa-learn [--spam | --ham]) has only very limited success >> in learning about new spam. Regardless of how many spams and hams are >> submitted, the effectiveness never goes above the default level which, >> in our case here, is somewhere around 2 out of 3 spams correctly >> identified. By the same token, after adding the "third party" rule, >> airmax.cf, the effectiveness went up to 99 out of 100 spams correctly >> identified. > > > Realistically, I don't know why your hit rates are so low. They shouldn't be > so > bad that you're only detecting 2 or 3 out of every hundred. > > You could have some configuration problems, but I can't tell as you've not > told > us anything about your system, just the problems you have. > > Can you answer a few questions that might help us diagnose some of your > problems: > > What version of SA are you running? > > Can you post an X-Spam-Status header for one of the false negatives? > > Is any of your spam hitting ALL_TRUSTED? > > What BAYES rules are these messages hitting before and after training? > > Do you use any network checks (URIBLs, RBLs, DCC, Razor, Pyzor, SPF)? > > >> >> So far, we have not had a single ham misidentified as spam with over one >> million messages examined. >> >> Throughout the documentation, there seems to be a bias toward the bayes >> filter rather than the rule system. Does anyone on the list have some >> thoughts which would help to explain my observation as to why a single >> rule would appear so successful while a million spams and hams would >> have so little effect? >> > > Correction, airmax.cf is not one single rule, it's one single FILE containing > 211 rules. That's a significant difference, given that the stock spamassassin > 3.1.0 has about 723 rules. > > Airmax has increased the number of rules in your system by 29.1% > > > > > Do you recommend running airmax as a supplementary ruleset with 3.1.0? -- Matthew Yette Senior Engineer (NOC/Operations) M.A. Polce Consulting 315-838-1644
Re: rules better than bayes?
jo3 wrote: > Hi, > > This is an observation, please take it in the spirit in which it is > intended, it is not meant to be flame bait. > > After using spamassassin for six solid months, it seems to me that the > bayes process (sa-learn [--spam | --ham]) has only very limited success > in learning about new spam. Regardless of how many spams and hams are > submitted, the effectiveness never goes above the default level which, > in our case here, is somewhere around 2 out of 3 spams correctly > identified. By the same token, after adding the "third party" rule, > airmax.cf, the effectiveness went up to 99 out of 100 spams correctly > identified. Realistically, I don't know why your hit rates are so low. They shouldn't be so bad that you're only detecting 2 or 3 out of every hundred. You could have some configuration problems, but I can't tell as you've not told us anything about your system, just the problems you have. Can you answer a few questions that might help us diagnose some of your problems: What version of SA are you running? Can you post an X-Spam-Status header for one of the false negatives? Is any of your spam hitting ALL_TRUSTED? What BAYES rules are these messages hitting before and after training? Do you use any network checks (URIBLs, RBLs, DCC, Razor, Pyzor, SPF)? > > So far, we have not had a single ham misidentified as spam with over one > million messages examined. > > Throughout the documentation, there seems to be a bias toward the bayes > filter rather than the rule system. Does anyone on the list have some > thoughts which would help to explain my observation as to why a single > rule would appear so successful while a million spams and hams would > have so little effect? > Correction, airmax.cf is not one single rule, it's one single FILE containing 211 rules. That's a significant difference, given that the stock spamassassin 3.1.0 has about 723 rules. Airmax has increased the number of rules in your system by 29.1%