Re: rules better than bayes?

2006-01-18 Thread Jim Maul

Dallas L. Engelken wrote:

-Original Message-
From: Jim Maul [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 18, 2006 8:55 AM

To: users@spamassassin.apache.org
Subject: Re: rules better than bayes?

Dallas L. Engelken wrote:

-Original Message-
From: Jim Maul [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 11, 2006 1:49 PM
To: Chris Lear
Cc: users@spamassassin.apache.org
Subject: Re: rules better than bayes?

Chris Lear wrote:

* Jim Maul wrote (11/01/06 17:48):
[...]

i dont have any sa-stats.pl on my system, and i recall

some confusion

with different scripts named the same thing so im not

sure.  If you

can provide me with a location to obtain the sa-stats.pl

script you

are talking about i'll try to give it a run when i get

some time.  Im

running 2.64 through qmail-scanner if it matters.

Here's a version of sa-stats that works. I remember having

a hard time

finding it, so hopefully this saves you some effort.
I've edited this line:
if (!defined $FILE) { $FILE='^spamd$' }  # regex but it's

overridable

on the commandline anyway.

Chris


#!/usr/bin/perl

# file: sa-stats.pl
# date: 2005-07-27
# version: 0.9
# author: Dallas Engelken <[EMAIL PROTECTED]> # desc: SA 3.x

log parser
This appears to be for 3.x (the description above).  Will 
this work 

for
2.64 which im still running?  Is there a working version somewhere 
that will?


Tell ya truth, I don't even know if it works on 2.64.   It 

was created
after 3.0 was released.  If your SA logs to maillog, just 
run it and 
find out.  If you see data, it does... It doesn't take long to test 
this perl script because it doesn't have any prereqs that wouldn't 
already be on a SA installed box.


There is also 
http://www.rulesemporium.com/programs/sa-stats-1.0.txt 
for 3.1.x which supports per-domain and per-user stats... 
But that's 

just FYI.

Dallas


This doesnt work for 2.64 by the way.  Its looking for 
result= and scantime= and various other things which arent in 
my spamd log.  My log entries look like:


Jan 18 09:51:30 external spamd[2783]: connection from 
localhost [127.0.0.1] at port 39076 Jan 18 09:51:30 external 
spamd[16806]: processing message 
<[EMAIL PROTECTED]

ro.us> for [EMAIL PROTECTED]:512.

Jan 18 09:51:31 external spamd[16806]: clean message (-4.9/5.0) for
[EMAIL PROTECTED]:512 in 1.7 seconds, 3128 bytes.

Thanks anyway for the help,

Jim



Should be fairly simple to modify the regex to work with 2.64 and then
adjust a couple values that don't apply. 


Is it impossible to upgrade your SA install?

Dallas



Its not impossible but im in the process of setting up a new machine 
running new versions of everything so im avoiding upgrading anything 
that isnt absolutely necessary.  The current machine is only running RH9 
so im starting fresh with a new server which will be running the newest 
of SA.  Hopefully i can still keep my old bayes DB succesfully and run 
the stats off of that when the time comes.  Its still a couple weeks 
away as i just cant find enough time to finish building this machine.


Thanks for everything

-Jim


RE: rules better than bayes?

2006-01-18 Thread Dallas L. Engelken
> -Original Message-
> From: Jim Maul [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, January 18, 2006 8:55 AM
> To: users@spamassassin.apache.org
> Subject: Re: rules better than bayes?
> 
> Dallas L. Engelken wrote:
> >> -Original Message-
> >> From: Jim Maul [mailto:[EMAIL PROTECTED]
> >> Sent: Wednesday, January 11, 2006 1:49 PM
> >> To: Chris Lear
> >> Cc: users@spamassassin.apache.org
> >> Subject: Re: rules better than bayes?
> >>
> >> Chris Lear wrote:
> >>> * Jim Maul wrote (11/01/06 17:48):
> >>> [...]
> >>>> i dont have any sa-stats.pl on my system, and i recall
> >> some confusion
> >>>> with different scripts named the same thing so im not
> >> sure.  If you
> >>>> can provide me with a location to obtain the sa-stats.pl
> >> script you
> >>>> are talking about i'll try to give it a run when i get
> >> some time.  Im
> >>>> running 2.64 through qmail-scanner if it matters.
> >>> Here's a version of sa-stats that works. I remember having
> >> a hard time
> >>> finding it, so hopefully this saves you some effort.
> >>> I've edited this line:
> >>> if (!defined $FILE) { $FILE='^spamd$' }  # regex but it's
> >> overridable
> >>> on the commandline anyway.
> >>>
> >>> Chris
> >>>
> >>>
> >>> #!/usr/bin/perl
> >>>
> >>> # file: sa-stats.pl
> >>> # date: 2005-07-27
> >>> # version: 0.9
> >>> # author: Dallas Engelken <[EMAIL PROTECTED]> # desc: SA 3.x
> >> log parser
> >> This appears to be for 3.x (the description above).  Will 
> this work 
> >> for
> >> 2.64 which im still running?  Is there a working version somewhere 
> >> that will?
> >>
> > 
> > Tell ya truth, I don't even know if it works on 2.64.   It 
> was created
> > after 3.0 was released.  If your SA logs to maillog, just 
> run it and 
> > find out.  If you see data, it does... It doesn't take long to test 
> > this perl script because it doesn't have any prereqs that wouldn't 
> > already be on a SA installed box.
> > 
> > There is also 
> http://www.rulesemporium.com/programs/sa-stats-1.0.txt 
> > for 3.1.x which supports per-domain and per-user stats... 
> But that's 
> > just FYI.
> > 
> > Dallas
> > 
> > 
> 
> This doesnt work for 2.64 by the way.  Its looking for 
> result= and scantime= and various other things which arent in 
> my spamd log.  My log entries look like:
> 
> Jan 18 09:51:30 external spamd[2783]: connection from 
> localhost [127.0.0.1] at port 39076 Jan 18 09:51:30 external 
> spamd[16806]: processing message 
> <[EMAIL PROTECTED]
ro.us> for [EMAIL PROTECTED]:512.
> Jan 18 09:51:31 external spamd[16806]: clean message (-4.9/5.0) for
> [EMAIL PROTECTED]:512 in 1.7 seconds, 3128 bytes.
> 
> Thanks anyway for the help,
> 
> Jim
> 

Should be fairly simple to modify the regex to work with 2.64 and then
adjust a couple values that don't apply. 

Is it impossible to upgrade your SA install?

Dallas




Re: rules better than bayes?

2006-01-18 Thread Jim Maul

Dallas L. Engelken wrote:

-Original Message-
From: Jim Maul [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 11, 2006 1:49 PM

To: Chris Lear
Cc: users@spamassassin.apache.org
Subject: Re: rules better than bayes?

Chris Lear wrote:

* Jim Maul wrote (11/01/06 17:48):
[...]
i dont have any sa-stats.pl on my system, and i recall 
some confusion 
with different scripts named the same thing so im not 
sure.  If you 
can provide me with a location to obtain the sa-stats.pl 
script you 
are talking about i'll try to give it a run when i get 
some time.  Im 

running 2.64 through qmail-scanner if it matters.
Here's a version of sa-stats that works. I remember having 
a hard time 

finding it, so hopefully this saves you some effort.
I've edited this line:
if (!defined $FILE) { $FILE='^spamd$' }  # regex but it's 
overridable 

on the commandline anyway.

Chris


#!/usr/bin/perl

# file: sa-stats.pl
# date: 2005-07-27
# version: 0.9
# author: Dallas Engelken <[EMAIL PROTECTED]> # desc: SA 3.x 

log parser
This appears to be for 3.x (the description above).  Will 
this work for
2.64 which im still running?  Is there a working version 
somewhere that will?




Tell ya truth, I don't even know if it works on 2.64.   It was created
after 3.0 was released.  If your SA logs to maillog, just run it and
find out.  If you see data, it does... It doesn't take long to test this
perl script because it doesn't have any prereqs that wouldn't already be
on a SA installed box.

There is also http://www.rulesemporium.com/programs/sa-stats-1.0.txt for
3.1.x which supports per-domain and per-user stats... But that's just
FYI.

Dallas




This doesnt work for 2.64 by the way.  Its looking for result= and 
scantime= and various other things which arent in my spamd log.  My log 
entries look like:


Jan 18 09:51:30 external spamd[2783]: connection from localhost 
[127.0.0.1] at port 39076
Jan 18 09:51:30 external spamd[16806]: processing message 
<[EMAIL PROTECTED]> for 
[EMAIL PROTECTED]:512.
Jan 18 09:51:31 external spamd[16806]: clean message (-4.9/5.0) for 
[EMAIL PROTECTED]:512 in 1.7 seconds, 3128 bytes.


Thanks anyway for the help,

Jim


RE: rules better than bayes?

2006-01-11 Thread Dallas L. Engelken
> -Original Message-
> From: Jim Maul [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, January 11, 2006 1:49 PM
> To: Chris Lear
> Cc: users@spamassassin.apache.org
> Subject: Re: rules better than bayes?
> 
> Chris Lear wrote:
> > * Jim Maul wrote (11/01/06 17:48):
> > [...]
> >> i dont have any sa-stats.pl on my system, and i recall 
> some confusion 
> >> with different scripts named the same thing so im not 
> sure.  If you 
> >> can provide me with a location to obtain the sa-stats.pl 
> script you 
> >> are talking about i'll try to give it a run when i get 
> some time.  Im 
> >> running 2.64 through qmail-scanner if it matters.
> > 
> > Here's a version of sa-stats that works. I remember having 
> a hard time 
> > finding it, so hopefully this saves you some effort.
> > I've edited this line:
> > if (!defined $FILE) { $FILE='^spamd$' }  # regex but it's 
> overridable 
> > on the commandline anyway.
> > 
> > Chris
> > 
> > 
> > #!/usr/bin/perl
> > 
> > # file: sa-stats.pl
> > # date: 2005-07-27
> > # version: 0.9
> > # author: Dallas Engelken <[EMAIL PROTECTED]> # desc: SA 3.x 
> log parser
> > 
> 
> This appears to be for 3.x (the description above).  Will 
> this work for
> 2.64 which im still running?  Is there a working version 
> somewhere that will?
> 

Tell ya truth, I don't even know if it works on 2.64.   It was created
after 3.0 was released.  If your SA logs to maillog, just run it and
find out.  If you see data, it does... It doesn't take long to test this
perl script because it doesn't have any prereqs that wouldn't already be
on a SA installed box.

There is also http://www.rulesemporium.com/programs/sa-stats-1.0.txt for
3.1.x which supports per-domain and per-user stats... But that's just
FYI.

Dallas


Re: rules better than bayes?

2006-01-11 Thread Jim Maul

Chris Lear wrote:

* Jim Maul wrote (11/01/06 17:48):
[...]
i dont have any sa-stats.pl on my system, and i recall some confusion 
with different scripts named the same thing so im not sure.  If you can 
provide me with a location to obtain the sa-stats.pl script you are 
talking about i'll try to give it a run when i get some time.  Im 
running 2.64 through qmail-scanner if it matters.


Here's a version of sa-stats that works. I remember having a hard time
finding it, so hopefully this saves you some effort.
I've edited this line:
if (!defined $FILE) { $FILE='^spamd$' }  # regex
but it's overridable on the commandline anyway.

Chris


#!/usr/bin/perl

# file: sa-stats.pl
# date: 2005-07-27
# version: 0.9
# author: Dallas Engelken <[EMAIL PROTECTED]>
# desc: SA 3.x log parser



This appears to be for 3.x (the description above).  Will this work for 
2.64 which im still running?  Is there a working version somewhere that 
will?


Thanks,

-Jim


Re: rules better than bayes?

2006-01-11 Thread Jim Maul

jdow wrote:

From: "Jim Maul" <[EMAIL PROTECTED]>


Chris Santerre wrote:


 > -Original Message-
 > From: jo3 [mailto:[EMAIL PROTECTED]
 > Sent: Monday, January 09, 2006 2:28 PM
 > To: users@spamassassin.apache.org
 > Subject: rules better than bayes?
 >
 >
 > Hi,
 >
 > This is an observation, please take it in the spirit in which it is
 > intended, it is not meant to be flame bait.
 >
 > After using spamassassin for six solid months, it seems to me
 > that the
 > bayes process (sa-learn [--spam | --ham]) has only very
 > limited success
 > in learning about new spam.  Regardless of how many spams and
 > hams are
 > submitted, the effectiveness never goes above the default
 > level which,
 > in our case here, is somewhere around 2 out of 3 spams correctly
 > identified.  By the same token, after adding the "third party" rule,
 > airmax.cf, the effectiveness went up to 99 out of 100 spams correctly
 > identified.

I have long said that IMHO, I do not think bayes is worth it. Left 
unattended, it isn't as good. A simple rule can take out a lot of 
spam. Some may say rule writing is more complicated then training 
bayes. Maybe. Not so much the rule writing, but the figuring out what 
to look for and testing for FPs.


I do not run Bayes for our company. Obviously I'm partial to 
URIBL.com and SARE rules ;)  I get about 98% of spam caught, and 
little FPs.


This is going to sound like tooting our own horn, but so be it. 
Before SARE, Bayes was cool. After SARE, I see no need.





I always feel i have to point out the flip side to this just to offer 
another opinion.  While i certainly dont have a NEED for bayes at our 
facility, i do run it, complete with autolearn.  We have very low 
volume (5k msgs/day) but it works so well i rarely ever have to think 
about it.   For us, 96% of the time bayes alone is enough to say 
whether a message is ham/spam.  Add all the other tests on top of this 
(uribl, razor, a few sare, and theres easily a 20 point difference 
between ham and spam.


Jim, can you back that up with a run of the SARE version of sa_stats.pl?
I'd love to see your record with that setup for the highest and lowest
ranking BAYES scores.

{^_^}





i dont have any sa-stats.pl on my system, and i recall some confusion 
with different scripts named the same thing so im not sure.  If you can 
provide me with a location to obtain the sa-stats.pl script you are 
talking about i'll try to give it a run when i get some time.  Im 
running 2.64 through qmail-scanner if it matters.


-Jim


Re: rules better than bayes?

2006-01-10 Thread jdow

From: "Matt Kettler" <[EMAIL PROTECTED]>


At 10:50 AM 1/10/2006, Chris Santerre wrote:


I have long said that IMHO, I do not think bayes is worth it. Left 
unattended, it isn't as good. A simple rule can take out a lot of spam. 
Some may say rule writing is more complicated then training bayes. Maybe. 
Not so much the rule writing, but the figuring out what to look for and 
testing for FPs.



Interesting.. For me, BAYES_99 is right between SURBL and URIBL in terms of 
hits. (And has 98.91% of URIBL's total hits) I find it completely 
indispensable.


It's number 1 here on scoring spam, 83.22 for 0.05 of ham with "can't
remember the last ham scoring on 99 that hit the spam folder." 99 has
a score of 5 here because it does, all alone, tag spam that no other
rule hits. XBL is the best BL here at the moment, 55.50% for 0.04% of
hits on ham.

I rarely train manually, except at initial setup where I feed it a good 
base learning. (the autolearner can sometimes go awry if you don't train 
some mail manually before letting it go.)


I manually learn, particularly on spam not marked as spam that has a
low BAYES score and some "meat in it." (I don't bother with content
free spam. Those very quickly score higher due to BL hits that pop up
like magic.)

{^_^}



Re: rules better than bayes?

2006-01-10 Thread jdow

From: "Jim Maul" <[EMAIL PROTECTED]>


Chris Santerre wrote:


 > -Original Message-
 > From: jo3 [mailto:[EMAIL PROTECTED]
 > Sent: Monday, January 09, 2006 2:28 PM
 > To: users@spamassassin.apache.org
 > Subject: rules better than bayes?
 >
 >
 > Hi,
 >
 > This is an observation, please take it in the spirit in which it is
 > intended, it is not meant to be flame bait.
 >
 > After using spamassassin for six solid months, it seems to me
 > that the
 > bayes process (sa-learn [--spam | --ham]) has only very
 > limited success
 > in learning about new spam.  Regardless of how many spams and
 > hams are
 > submitted, the effectiveness never goes above the default
 > level which,
 > in our case here, is somewhere around 2 out of 3 spams correctly
 > identified.  By the same token, after adding the "third party" rule,
 > airmax.cf, the effectiveness went up to 99 out of 100 spams correctly
 > identified.

I have long said that IMHO, I do not think bayes is worth it. Left unattended, 
it isn't as good. A simple rule can take out a lot of spam. Some may say rule 
writing is more complicated then training bayes. Maybe. Not so much the rule 
writing, but the figuring out what to look for and testing for FPs.


I do not run Bayes for our company. Obviously I'm partial to URIBL.com and SARE 
rules ;)  I get about 98% of spam caught, and little FPs.


This is going to sound like tooting our own horn, but so be it. Before SARE, 
Bayes was cool. After SARE, I see no need.





I always feel i have to point out the flip side to this just to offer 
another opinion.  While i certainly dont have a NEED for bayes at our 
facility, i do run it, complete with autolearn.  We have very low volume 
(5k msgs/day) but it works so well i rarely ever have to think about it. 
  For us, 96% of the time bayes alone is enough to say whether a 
message is ham/spam.  Add all the other tests on top of this (uribl, 
razor, a few sare, and theres easily a 20 point difference between ham 
and spam.


Jim, can you back that up with a run of the SARE version of sa_stats.pl?
I'd love to see your record with that setup for the highest and lowest
ranking BAYES scores.

{^_^}



Re: rules better than bayes?

2006-01-10 Thread jdow

From: "Chris Santerre" <[EMAIL PROTECTED]>

-Original Message-
From: jo3 [mailto:[EMAIL PROTECTED]

Hi,

This is an observation, please take it in the spirit in which it is 
intended, it is not meant to be flame bait.


After using spamassassin for six solid months, it seems to me 
that the 
bayes process (sa-learn [--spam | --ham]) has only very 
limited success 
in learning about new spam.  Regardless of how many spams and 
hams are 
submitted, the effectiveness never goes above the default 
level which, 
in our case here, is somewhere around 2 out of 3 spams correctly 
identified.  By the same token, after adding the "third party" rule, 
airmax.cf, the effectiveness went up to 99 out of 100 spams correctly 
identified.


I have long said that IMHO, I do not think bayes is worth it. Left
unattended, it isn't as good. A simple rule can take out a lot of spam. Some
may say rule writing is more complicated then training bayes. Maybe. Not so
much the rule writing, but the figuring out what to look for and testing for
FPs. 


I do not run Bayes for our company. Obviously I'm partial to URIBL.com and
SARE rules ;)  I get about 98% of spam caught, and little FPs. 


This is going to sound like tooting our own horn, but so be it. Before SARE,
Bayes was cool. After SARE, I see no need. 


Autolearning Bayes is not really very good based on what people here
seem to say. I do note that I raised by BAYES_99 score to 5. If BAYES_99
hits the odds that the message is spam are so high that it's silly to
give BAYES_99 a low score, theoretical nonsense notwithstanding.

If you apply the wrong statistical theory with the wrong conceptual
criteria the math or theory may be good but the results are trash. For
an existing spam database the rule setup that exists is probably quite
good. If 99 hits then other rules probably hit as well. This leads to
artificially lowering the 99 score. Then when a new technique hits that
Bayes can recognize but nothing else does comes along the message floats
on through. At least on this system 99 misses once in 2000 to 1
times. Most of those times other very light whitelisting rules let the
messages come through. Probably the right score for more general use
would be 4.95 or something such that if any other rule hits it's dinged
as spam. It depends on your spam tolerance compared to your tolerance
for sorting spam by score and looking at the few that are marginal.

Anyway, making that ONE change made the already good results I was getting
with SARE and BAYES combined quite a bit better. Missed spam went down
almost a factor of 10 and tagged ham went up by about 1 in 10,000 or
less. (I can't remember the last time I got a ham marked as spam on
the sole basis of BAYES_99 with a score of 5 that I had to fetch out of
the spam folder.) I take this as a proof of concept that penalizing a
rule for being too good is ridiculous on its face, statistical theories
notwithstanding. I maintain this is a positive indication that either
the criteria, the chosen statistical approach, or both are wrong.

It might be entertaining to setup "stock" BAYES on your system, Chris,
with all BAYES scores being very very low, 0.01 or something. Then run
the SARE version of sa_stats.pl to see what the "goodness" of each
BAYES level really is. From that you can guesstimate some scores that
might improve your system. I'd be really interested to see what the
autolearn BAYES really can perform like when it's used in your sort
of environment. I know for my environment it's silly to use it due to
the automated mis-learning on marginal messages. (Either it learns
wrong or not at all on the most critical portions of the email load,
the marginal messages.)

{^_^}   Joanne steps down off her soapbox yet again.



Re: rules better than bayes?

2006-01-10 Thread mouss
Chris Santerre a écrit :

> 
> I have long said that IMHO, I do not think bayes is worth it. Left
> unattended, it isn't as good. A simple rule can take out a lot of spam. Some
> may say rule writing is more complicated then training bayes. Maybe. Not so
> much the rule writing, but the figuring out what to look for and testing for
> FPs. 
> 
> I do not run Bayes for our company. Obviously I'm partial to URIBL.com and
> SARE rules ;)  I get about 98% of spam caught, and little FPs. 
> 
> This is going to sound like tooting our own horn, but so be it. Before SARE,
> Bayes was cool. After SARE, I see no need. 

I think SARE and bayes are complementary:

- sare will detect new spam once ninjas have found the corresponding rules.

- bayes will detect new spam if it resembles previous spam.

That said, I don't use SA/Bayes (I use dspam on a per-user basis, while
SA is site-wide).


Re: rules better than bayes? Hamtrap learning.

2006-01-10 Thread Matt Kettler
Andrew Donkin wrote:
> Matt Kettler <[EMAIL PROTECTED]> writes:
> 
> 
>>if [ -f /var/spool/mail/spamtrap ]; then
>> echo learning spam mailbox - spamtrap
>> mv /var/spool/mail/spamtrap .
>> /usr/bin/sa-learn --spam --mbox spamtrap
>> rm spam/spamtrap.alearn5.gz
>> mv spam/spamtrap.alearn4.gz spam/spamtrap.alearn5.gz
>> mv spam/spamtrap.alearn3.gz spam/spamtrap.alearn4.gz
>> mv spam/spamtrap.alearn2.gz spam/spamtrap.alearn3.gz
>> gzip spam/spamtrap.alearn1
>> mv spam/spamtrap.alearn1.gz spam/spamtrap.alearn2.gz
>>
>> mv spamtrap spam/spamtrap.alearn1
>>fi
> 
> 
> I'll put my Captain Pedantic hat on and point out that if your MTA is
> writing to /var/spool/mail/spamtrap at the time that you learn it,
> which is quite possible if /var/spool/training/ is on the same
> filesystem as /var/spool/mail/, sa-learn may end up chewing on a
> half-finished message.

Actually, they're on separate filesystems.

But you're right, I forgot that mv can "move" a file within a filesystem and
another process can still write to it with an old file descriptor.


Re: rules better than bayes? Certainly better than mine.

2006-01-10 Thread Jim Maul

Andrew Donkin wrote:

Jim Maul <[EMAIL PROTECTED]> writes:


NOTE: to operate in this fashion i believe it is imperative that you
change the autolearn thresholds.  The defaults are dangerous! (atleast
in 2.64 which i still run).  I have mine set as such:

bayes_auto_learn_threshold_nonspam -0.1
bayes_auto_learn_threshold_spam 10.0


Matt agreed.  Aaron was going to change to something similar.

Before reading this thread, I did the opposite.  I changed my nonspam
threshold from -0.2 to the default 0.1 because Bayes I thought
(mistakenly perhaps) that the Bayes database's spam:ham ratio was far
too high.  Incoming mail is about 3:1, but the Bayes database was more
like 20:1.  See:

 3 bayes db version
   1491805 nspam
 75795 nham
   1081029 ntokens
1136779207 oldest atime
1136925099 newest atime
1136925026 last journal sync atime
1136838312 last expiry atime
 43200 last expire atime delta
 25087 last expire reduction count



I started autolearning with the defaults and then quickly changed my 
thresholds as mentioned before.  Our server here doesnt see a lot of 
spam (hell it doesnt even see a lot of mail total) so our ratios are 
obviously going to be different.  Mine shows:


 2  0  non-token data: bayes db version
 26378  0  non-token data: nspam
 54313  0  non-token data: nham
147479  0  non-token data: ntokens
1134172970  0  non-token data: oldest atime
1136925620  0  non-token data: newest atime
1136925554  0  non-token data: last journal sync atime
1136232703  0  non-token data: last expiry atime
   2060396  0  non-token data: last expire atime delta
 34608  0  non-token data: last expire reduction count





In particular, a message from James Keating of this list received this
report from Bayes:

X-Spam-Bayes-ham: 0.011-8--5h-0s--19d--SpamAssassin, 
	0.026-3--2h-0s--19d--autolearn, 0.029-203--156h-39s--19d--5.0, 
	0.031-7--5h-1s--19d--spamassassin, 0.050-4162--3796h-1707s--0d--i'm
X-Spam-Bayes-spam: 1.000-149--0h-6920s--1d--HX-Accept-Language:en-us, 
	1.000-27--0h-1229s--18d--H*UA:Thunderbird, 
	1.000-24--0h-1083s--18d--H*u:Thunderbird, 
	1.000-16--0h-718s--0d--H*RU:sk:cpe-24-, 
	1.000-13--0h-594s--11d--H*r:sk:cpe-24-


...implying that "User-agent: Thunderbird" was in a thousand spams but
no hams.  And that "Accept-Language:en-us" was in 6900 spams and no
hams.  !

So, I'm thinking that my Bayes is hosed again.  Will a hamtrap help me
here?



Im not sure, i've never seen this report before and i certainly dont 
have the same message to compare what it scored on my system here.  Have 
you noticed bayes misclassifying messages because of this, or are you 
speaking theoretically?  A huge ratio alone does not imply a problem, 
its the results that matter.



I'm CCing you, Jim, because my last two posts to the list vanished
without a trace.



Not a problem.  Just not sure how much help i am in this situation...

-Jim


Re: rules better than bayes?

2006-01-10 Thread William Stearns

Good evening, Justin, all,

On Tue, 10 Jan 2006, Justin Mason wrote:


-(Modified PGP heading)-
Hash: SHA1

Matt Kettler writes:

At 10:50 AM 1/10/2006, Chris Santerre wrote:


I have long said that IMHO, I do not think bayes is worth it. Left
unattended, it isn't as good. A simple rule can take out a lot of spam.
Some may say rule writing is more complicated then training bayes. Maybe.
Not so much the rule writing, but the figuring out what to look for and
testing for FPs.


Interesting.. For me, BAYES_99 is right between SURBL and URIBL in terms of
hits. (And has 98.91% of URIBL's total hits) I find it completely
indispensable.


The thing is, Bayes is a tool for personalization -- and as such, its
effectiveness varies widely depending on what *you* do with it.

For what it's worth, I've *never* trained my current Bayes DB, and have
been running with it for about 6 months I think.  I get BAYES_00 on most
ham, and BAYES_99 on most spam.

But the 4 letters that matter with Bayes are:

   YMMV


	Isn't that an OTCBB Ticker symbol?  I heard they're about to go 
through the _roof_!!

/me ducks...
Cheers,
- Bill

---
"We don't want an election without a paper trail...all three
owners of the companies who make these machines are donors to the Bush
administration.  Is this not corruption?"
-- Gore Vidal
(Courtesy of http://www.laweekly.com/ink/03/52/features-cooper.php)
--
William Stearns ([EMAIL PROTECTED]).  Mason, Buildkernel, freedups, p0f,
rsync-backup, ssh-keyinstall, dns-check, more at:   http://www.stearns.org
--


Re: rules better than bayes?

2006-01-10 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Matt Kettler writes:
> At 10:50 AM 1/10/2006, Chris Santerre wrote:
> 
> >I have long said that IMHO, I do not think bayes is worth it. Left 
> >unattended, it isn't as good. A simple rule can take out a lot of spam. 
> >Some may say rule writing is more complicated then training bayes. Maybe. 
> >Not so much the rule writing, but the figuring out what to look for and 
> >testing for FPs.
> 
> Interesting.. For me, BAYES_99 is right between SURBL and URIBL in terms of 
> hits. (And has 98.91% of URIBL's total hits) I find it completely 
> indispensable.

The thing is, Bayes is a tool for personalization -- and as such, its
effectiveness varies widely depending on what *you* do with it.

For what it's worth, I've *never* trained my current Bayes DB, and have
been running with it for about 6 months I think.  I get BAYES_00 on most
ham, and BAYES_99 on most spam.

But the 4 letters that matter with Bayes are:

YMMV

;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDxAWfMJF5cimLx9ARAvvfAJwIiQQpAzBPYNEKnQiWLw4NMmxZewCfTxEg
qquh5FGGGQFwFU6TdOlIDi0=
=CcrR
-END PGP SIGNATURE-



Re: rules better than bayes?

2006-01-10 Thread Marc Perkel
Bayes would be much good if not for the rules to create a basic compass 
as to what is spam and not spam. The rules in a large part is what makes 
bayes work.


RE: rules better than bayes?

2006-01-10 Thread Aaron Grewell
 

> Im not matt, but running a very similar setup which works 
> very well so i thought i would comment also.  Im running a 
> single sitewide database. 
> All mail is processed under my spamd user.

OK, that's basically what I'm doing too.

> 
> I rarely train manually as well.  

> NOTE: to operate in this fashion i believe it is imperative that you 
> change the autolearn thresholds.  The defaults are dangerous! 
> (atleast 
> in 2.64 which i still run).  I have mine set as such:
> 
> bayes_auto_learn_threshold_nonspam -0.1
> bayes_auto_learn_threshold_spam 10.0
> 

OK, Matt said something similar about the thresholds.  Mine are default
so that may be part of the issue.  Thanks for the feedback!

-Aaron


RE: rules better than bayes?

2006-01-10 Thread Aaron Grewell
 
> Erm, that really shouldn't affect the bayes autolearner.. 
> perhaps you are
> thinking of the AWL? I don't run the AWL for this very reason.
>

Oh yeah.  I was thinking of the AWL.  NM.
 

> The problem is this requires some customization. This can't 
> be a default setup
> of SA as the "catch phrases" vary from place to place, and if 
> there was a
> default set of them spammers would be sure to always include 
> them, making them
> pointless. You'd effectively have the same thing as the 
> current default, by
> avoiding spam rules and existing bayes tokens they can get a 
> message learned.
> 

That all makes sense.  I'll give it a shot.  Thanks!

-Aaron


Re: rules better than bayes?

2006-01-10 Thread Kelson Vibber

Aaron Grewell wrote:

The trouble I had with the autolearner was that some spammers would send
innocuous mail through to raise their scores until Bayes decided they
were ok, then start spamming.  That was a couple of versions back, does
that sort of thing no longer work?


Are you sure this is Bayes-related?  Bayes looks at the entire message, 
not just the sender.  All I'd expect this tactic to do would be to make 
future innocuous mail look more innocuous -- it shouldn't have any 
significant impact on spammy mail from the same source since the content 
will be different.


--
Kelson Vibber
SpeedGate Communications, 


Re: rules better than bayes?

2006-01-10 Thread Jim Maul

Aaron Grewell wrote:

Hi Matt, I'm interested in how your setup compares to mine.  I also find
Bayes very useful, but I haven't gotten it to work as well as what
you've described.

Interesting.. For me, BAYES_99 is right between SURBL and 
URIBL in terms of 
hits. (And has 98.91% of URIBL's total hits) I find it completely 
indispensable.




Are you using a single site-wide database, or is this a per-user setup?



Im not matt, but running a very similar setup which works very well so i 
thought i would comment also.  Im running a single sitewide database. 
All mail is processed under my spamd user.



I rarely train manually, except at initial setup where I feed 
it a good 
base learning. (the autolearner can sometimes go awry if you 
don't train 
some mail manually before letting it go.)




The trouble I had with the autolearner was that some spammers would send
innocuous mail through to raise their scores until Bayes decided they
were ok, then start spamming.  That was a couple of versions back, does
that sort of thing no longer work?



I rarely train manually as well.  The only ones i train (and its only 
because there is nothing else to train) are spam which are correctly 
identified as such but have autolearn=no because they did not meet the 
autolearn criteria.  These almost always have BAYES_99 and a score of 20 
or so but most likely did not have enough header points to autolearn it.


I didnt even start training my database manually.  I started from 
scratch and let the autolearner do its thing.  I have never had to 
correct what it did because it was always always right.  The poison that 
spammers like to include in messages doesnt appear to have any affect on 
the overall outcome of the bayes score.  I dont really know why this is, 
it just works.


NOTE: to operate in this fashion i believe it is imperative that you 
change the autolearn thresholds.  The defaults are dangerous! (atleast 
in 2.64 which i still run).  I have mine set as such:


bayes_auto_learn_threshold_nonspam -0.1
bayes_auto_learn_threshold_spam 10.0

To this date (been running over 2 years) i have yet to see the 
autolearner misclassify.  Most bayes hits are the far extremes (bayes_99 
and bayes_0) with only a few in the 80-90 range.



On a day to day basis I mostly feed automatically with a cronjob that 
collects mail via spamtraps and hamtraps. I have that coupled with 
autolearning that's set a bit differently than the defaults. (IMNSHO, 
having a ham learning threshold that's positive is suicide, 
but I also have 
a large number of small negative-score rules so I can keep my 
threshold at 
-0.01 and actually autolearn some ham).




I'd love to make my Bayesian database more effective, is there a doc
somewhere that describes how you tuned it to your environment?



I doubt there is anything that specific and if there was, it most likely 
wouldnt help you in your situation.  There are general tuning notes on 
the SA website and such but you really just have to try and see what 
works and what doesnt in your setup.  What works well for 1 person may 
not work at all for someone else.


-Jim


Re: rules better than bayes?

2006-01-10 Thread Matt Kettler
Aaron Grewell wrote:
> Hi Matt, I'm interested in how your setup compares to mine.  I also find
> Bayes very useful, but I haven't gotten it to work as well as what
> you've described.
> 
> 
>>Interesting.. For me, BAYES_99 is right between SURBL and 
>>URIBL in terms of 
>>hits. (And has 98.91% of URIBL's total hits) I find it completely 
>>indispensable.
>>
> 
> 
> Are you using a single site-wide database, or is this a per-user setup?

Single site-wide.. I use mailscanner which does not support per-user, but I'm
not really looking for it.
> 
> 
>>I rarely train manually, except at initial setup where I feed 
>>it a good 
>>base learning. (the autolearner can sometimes go awry if you 
>>don't train 
>>some mail manually before letting it go.)
>>
> 
> 
> The trouble I had with the autolearner was that some spammers would send
> innocuous mail through to raise their scores until Bayes decided they
> were ok, then start spamming.  That was a couple of versions back, does
> that sort of thing no longer work?


Erm, that really shouldn't affect the bayes autolearner.. perhaps you are
thinking of the AWL? I don't run the AWL for this very reason.

>  >>On a day to day basis I mostly feed automatically with a cronjob that 
>>collects mail via spamtraps and hamtraps. I have that coupled with 
>>autolearning that's set a bit differently than the defaults. (IMNSHO, 
>>having a ham learning threshold that's positive is suicide, 
>>but I also have 
>>a large number of small negative-score rules so I can keep my 
>>threshold at 
>>-0.01 and actually autolearn some ham).
>>
> 
> 
> I'd love to make my Bayesian database more effective, is there a doc
> somewhere that describes how you tuned it to your environment?

Not really.. but it's not hard.

Spamtraps and hamtraps:
---
1) create a secret "hamtrap" email account. Subscribe this account to
newsletters and news feeds that your users typically subscribe to. Do not post
this address around, and don't use "hamtrap" as the account name, it's too 
obvious.

2) create a "spamtrap" account, or several of them. Carefully seed this out in
the body of some Usenet and mailing list postings.

3) create a cron-job that auto-feeds the above mail to sa-learn.

Simple example fragment of the script I use (it keeps a rotating archive of the
past 5 learning sessions):

#!/bin/sh
cd /var/spool/training/

if [ -f /var/spool/mail/spamtrap ]; then
 echo learning spam mailbox - spamtrap
 mv /var/spool/mail/spamtrap .
 /usr/bin/sa-learn --spam --mbox spamtrap
 rm spam/spamtrap.alearn5.gz
 mv spam/spamtrap.alearn4.gz spam/spamtrap.alearn5.gz
 mv spam/spamtrap.alearn3.gz spam/spamtrap.alearn4.gz
 mv spam/spamtrap.alearn2.gz spam/spamtrap.alearn3.gz
 gzip spam/spamtrap.alearn1
 mv spam/spamtrap.alearn1.gz spam/spamtrap.alearn2.gz

 mv spamtrap spam/spamtrap.alearn1
fi

4) Carefully monitor the data being fed for a while (two weeks or so) to make
sure there's no pollution. After it's established you can monitor it less often.


Autolearn adjustment:

1) add  bayes_auto_learn_threshold_nonspam -0.01 to your local.cf

2) create a "bayes_hamlearning.cf" file. Create several simple body text rules
with "catch phrases" from your normal nonspam. Assign these rules very small
negative scores (-0.01 to -0.1). This is generally easier in a corporate
environment, but it can be done in academic too.

body LOCAL_THESIS   /\bThesis\b/i
score LOCAL_THESIS  -0.01

You have to keep the scores small, as you don't want to use these to whitelist
spam mail. You merely want to make mail that would otherwise score 0 earn a
small negative score if it's got some of these phrases in it. It's not perfect,
but it's better than blindly learning everything under 0.5. I feel learning as
ham should be earned, not a default for not hitting any rules at all.

The problem is this requires some customization. This can't be a default setup
of SA as the "catch phrases" vary from place to place, and if there was a
default set of them spammers would be sure to always include them, making them
pointless. You'd effectively have the same thing as the current default, by
avoiding spam rules and existing bayes tokens they can get a message learned.










RE: rules better than bayes?

2006-01-10 Thread Aaron Grewell
Hi Matt, I'm interested in how your setup compares to mine.  I also find
Bayes very useful, but I haven't gotten it to work as well as what
you've described.

> 
> Interesting.. For me, BAYES_99 is right between SURBL and 
> URIBL in terms of 
> hits. (And has 98.91% of URIBL's total hits) I find it completely 
> indispensable.
> 

Are you using a single site-wide database, or is this a per-user setup?

> I rarely train manually, except at initial setup where I feed 
> it a good 
> base learning. (the autolearner can sometimes go awry if you 
> don't train 
> some mail manually before letting it go.)
> 

The trouble I had with the autolearner was that some spammers would send
innocuous mail through to raise their scores until Bayes decided they
were ok, then start spamming.  That was a couple of versions back, does
that sort of thing no longer work?

> On a day to day basis I mostly feed automatically with a cronjob that 
> collects mail via spamtraps and hamtraps. I have that coupled with 
> autolearning that's set a bit differently than the defaults. (IMNSHO, 
> having a ham learning threshold that's positive is suicide, 
> but I also have 
> a large number of small negative-score rules so I can keep my 
> threshold at 
> -0.01 and actually autolearn some ham).
> 

I'd love to make my Bayesian database more effective, is there a doc
somewhere that describes how you tuned it to your environment?


RE: rules better than bayes?

2006-01-10 Thread Matt Kettler

At 10:50 AM 1/10/2006, Chris Santerre wrote:


I have long said that IMHO, I do not think bayes is worth it. Left 
unattended, it isn't as good. A simple rule can take out a lot of spam. 
Some may say rule writing is more complicated then training bayes. Maybe. 
Not so much the rule writing, but the figuring out what to look for and 
testing for FPs.



Interesting.. For me, BAYES_99 is right between SURBL and URIBL in terms of 
hits. (And has 98.91% of URIBL's total hits) I find it completely 
indispensable.


I rarely train manually, except at initial setup where I feed it a good 
base learning. (the autolearner can sometimes go awry if you don't train 
some mail manually before letting it go.)


On a day to day basis I mostly feed automatically with a cronjob that 
collects mail via spamtraps and hamtraps. I have that coupled with 
autolearning that's set a bit differently than the defaults. (IMNSHO, 
having a ham learning threshold that's positive is suicide, but I also have 
a large number of small negative-score rules so I can keep my threshold at 
-0.01 and actually autolearn some ham).


This setup is near zero maintenance, and highly effective. I can't see why 
it wouldn't be "worth it". It's almost as good as turning on URIBLs and not 
much more work. It's certainly much less work than rule writing. The last 
time I bothered to tinker with my bayes was before Christmas. 



RE: rules better than bayes?

2006-01-10 Thread Chris Santerre
Title: RE: rules better than bayes?






> I always feel i have to point out the flip side to this just to offer 
> another opinion. 


And I love ya for it ;)  (In the kind of brotherly love one man can feel for another)


> While i certainly dont have a NEED for bayes at our 
> facility, i do run it, complete with autolearn.  We have very 
> low volume 
> (5k msgs/day) but it works so well i rarely ever have to 
> think about it. 
>    For us, 96% of the time bayes alone is enough to say whether a 
> message is ham/spam.  Add all the other tests on top of this (uribl, 
> razor, a few sare, and theres easily a 20 point difference 
> between ham 
> and spam.
> 
> -Jim


LOL, yeah. The average spam score from last year has gone up quite a lot! SARE noticibly has less things took look at as far as new tactics to cover with rules. Making a spam go from a score of 20 to 21, just doesn't seem a big deal :) 

--Chris 





Re: rules better than bayes?

2006-01-10 Thread Jim Maul

Chris Santerre wrote:


 > -Original Message-
 > From: jo3 [mailto:[EMAIL PROTECTED]
 > Sent: Monday, January 09, 2006 2:28 PM
 > To: users@spamassassin.apache.org
 > Subject: rules better than bayes?
 >
 >
 > Hi,
 >
 > This is an observation, please take it in the spirit in which it is
 > intended, it is not meant to be flame bait.
 >
 > After using spamassassin for six solid months, it seems to me
 > that the
 > bayes process (sa-learn [--spam | --ham]) has only very
 > limited success
 > in learning about new spam.  Regardless of how many spams and
 > hams are
 > submitted, the effectiveness never goes above the default
 > level which,
 > in our case here, is somewhere around 2 out of 3 spams correctly
 > identified.  By the same token, after adding the "third party" rule,
 > airmax.cf, the effectiveness went up to 99 out of 100 spams correctly
 > identified.

I have long said that IMHO, I do not think bayes is worth it. Left unattended, 
it isn't as good. A simple rule can take out a lot of spam. Some may say rule 
writing is more complicated then training bayes. Maybe. Not so much the rule 
writing, but the figuring out what to look for and testing for FPs.


I do not run Bayes for our company. Obviously I'm partial to URIBL.com and SARE 
rules ;)  I get about 98% of spam caught, and little FPs.


This is going to sound like tooting our own horn, but so be it. Before SARE, 
Bayes was cool. After SARE, I see no need.





I always feel i have to point out the flip side to this just to offer 
another opinion.  While i certainly dont have a NEED for bayes at our 
facility, i do run it, complete with autolearn.  We have very low volume 
(5k msgs/day) but it works so well i rarely ever have to think about it. 
  For us, 96% of the time bayes alone is enough to say whether a 
message is ham/spam.  Add all the other tests on top of this (uribl, 
razor, a few sare, and theres easily a 20 point difference between ham 
and spam.


-Jim


RE: rules better than bayes?

2006-01-10 Thread Chris Santerre
Title: RE: rules better than bayes?







> -Original Message-
> From: jo3 [mailto:[EMAIL PROTECTED]]
> Sent: Monday, January 09, 2006 2:28 PM
> To: users@spamassassin.apache.org
> Subject: rules better than bayes?
> 
> 
> Hi,
> 
> This is an observation, please take it in the spirit in which it is 
> intended, it is not meant to be flame bait.
> 
> After using spamassassin for six solid months, it seems to me 
> that the 
> bayes process (sa-learn [--spam | --ham]) has only very 
> limited success 
> in learning about new spam.  Regardless of how many spams and 
> hams are 
> submitted, the effectiveness never goes above the default 
> level which, 
> in our case here, is somewhere around 2 out of 3 spams correctly 
> identified.  By the same token, after adding the "third party" rule, 
> airmax.cf, the effectiveness went up to 99 out of 100 spams correctly 
> identified.


I have long said that IMHO, I do not think bayes is worth it. Left unattended, it isn't as good. A simple rule can take out a lot of spam. Some may say rule writing is more complicated then training bayes. Maybe. Not so much the rule writing, but the figuring out what to look for and testing for FPs. 

I do not run Bayes for our company. Obviously I'm partial to URIBL.com and SARE rules ;)  I get about 98% of spam caught, and little FPs. 

This is going to sound like tooting our own horn, but so be it. Before SARE, Bayes was cool. After SARE, I see no need. 

Chris Santerre
SysAdmin and SARE/URIBL ninja
http://www.uribl.com
http://www.rulesemporium.com





Re: rules better than bayes?

2006-01-09 Thread Dhawal Doshy
Robert Bartlett writes: 


Ok I confused myself. Im sorry for being an idiot. I get it now. Everytime
an email comes in it tries to access it as the user, since bayes is being
feed to just the root account it doesn't see anything for the users in
bayes. With the override I force it to use the root account for all emails
coming in. Boy am I stupid. 


Thanks
Robert


Try out this to find the right value for bayes_sql_override_username. 

SELECT id, username, spam_count, ham_count, token_count FROM bayes_vars; 

- dhawal 


-Original Message-
From: Robert Bartlett [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 09, 2006 1:52 PM

To: users@spamassassin.apache.org
Subject: RE: rules better than bayes? 


Sorry for the confusion, I do use a site wide bayes database, I thought the
information I sent below was the site wide information the system uses to
access the bayes database. 


Thanks
Robert  


-Original Message-
From: Matt Kettler [mailto:[EMAIL PROTECTED]
Sent: Monday, January 09, 2006 1:47 PM
To: Robert Bartlett
Cc: users@spamassassin.apache.org
Subject: Re: rules better than bayes? 


Robert Bartlett wrote:
 This is what I have in my local.cf file: 


bayes_store_module   Mail::SpamAssassin::BayesStore::SQL
bayes_sql_dsnDBI:mysql:**:localhost:3306
bayes_sql_username   
bayes_sql_password    

Obviously I hid the data that I didn't want to show with *. When I run 
sa-learn it trains into the mysql database just fine, I assume SA 
connects to it just fine because of that.
 


That's all the database login information. That doesn't mean you have a
single sitewide bayes database. 

Again, I suggest looking at the  bayes_sql_override_username option. 








RE: rules better than bayes?

2006-01-09 Thread Dallas L. Engelken
> -Original Message-
> From: Matt Kettler [mailto:[EMAIL PROTECTED] 
> Sent: Monday, January 09, 2006 2:05 PM
> To: Matthew Yette
> Cc: users@spamassassin.apache.org
> Subject: Re: rules better than bayes?
> 
> [snip] 
> 
> I also strongly recommend enabling SA's URIBL support, and 
> adding on a .cf file to get uribl.com's list added in 
> (default SA only uses surbl.org lists)
> 
>   grep URIBL_BLACK /var/log/maillog |wc -l>2214
> 

yes, it gets lonely at the top sometimes... ;)BTW, we are looking
for additional mirrors if anyone has rbldnsd and a few kb/s to spare...
See www.uribl.com frontpage news for contact.

Actually for me, bayes and razor are constantly the two best hitters..
uribl black comes in a close 3rd

dallase


RE: rules better than bayes?

2006-01-09 Thread Robert Bartlett
Ok I confused myself. Im sorry for being an idiot. I get it now. Everytime
an email comes in it tries to access it as the user, since bayes is being
feed to just the root account it doesn't see anything for the users in
bayes. With the override I force it to use the root account for all emails
coming in. Boy am I stupid.

Thanks
Robert



-Original Message-
From: Robert Bartlett [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 09, 2006 1:52 PM
To: users@spamassassin.apache.org
Subject: RE: rules better than bayes?

Sorry for the confusion, I do use a site wide bayes database, I thought the
information I sent below was the site wide information the system uses to
access the bayes database.

Thanks
Robert 

-Original Message-
From: Matt Kettler [mailto:[EMAIL PROTECTED]
Sent: Monday, January 09, 2006 1:47 PM
To: Robert Bartlett
Cc: users@spamassassin.apache.org
Subject: Re: rules better than bayes?

Robert Bartlett wrote:
>  This is what I have in my local.cf file:
> 
> bayes_store_module   Mail::SpamAssassin::BayesStore::SQL
> bayes_sql_dsnDBI:mysql:**:localhost:3306
> bayes_sql_username   
> bayes_sql_password   
> 
> Obviously I hid the data that I didn't want to show with *. When I run 
> sa-learn it trains into the mysql database just fine, I assume SA 
> connects to it just fine because of that.


That's all the database login information. That doesn't mean you have a
single sitewide bayes database.

Again, I suggest looking at the  bayes_sql_override_username option.




RE: rules better than bayes?

2006-01-09 Thread Robert Bartlett
Sorry for the confusion, I do use a site wide bayes database, I thought the
information I sent below was the site wide information the system uses to
access the bayes database.

Thanks
Robert 

-Original Message-
From: Matt Kettler [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 09, 2006 1:47 PM
To: Robert Bartlett
Cc: users@spamassassin.apache.org
Subject: Re: rules better than bayes?

Robert Bartlett wrote:
>  This is what I have in my local.cf file:
> 
> bayes_store_module   Mail::SpamAssassin::BayesStore::SQL
> bayes_sql_dsnDBI:mysql:**:localhost:3306
> bayes_sql_username   
> bayes_sql_password   
> 
> Obviously I hid the data that I didn't want to show with *. When I run 
> sa-learn it trains into the mysql database just fine, I assume SA 
> connects to it just fine because of that.


That's all the database login information. That doesn't mean you have a
single sitewide bayes database.

Again, I suggest looking at the  bayes_sql_override_username option.



Re: rules better than bayes?

2006-01-09 Thread Matt Kettler
Robert Bartlett wrote:
>  This is what I have in my local.cf file:
> 
> bayes_store_module   Mail::SpamAssassin::BayesStore::SQL
> bayes_sql_dsnDBI:mysql:**:localhost:3306
> bayes_sql_username   
> bayes_sql_password   
> 
> Obviously I hid the data that I didn't want to show with *. When I run
> sa-learn it trains into the mysql database just fine, I assume SA connects
> to it just fine because of that.


That's all the database login information. That doesn't mean you have a single
sitewide bayes database.

Again, I suggest looking at the  bayes_sql_override_username option.


RE: rules better than bayes?

2006-01-09 Thread Robert Bartlett
 This is what I have in my local.cf file:

bayes_store_module   Mail::SpamAssassin::BayesStore::SQL
bayes_sql_dsnDBI:mysql:**:localhost:3306
bayes_sql_username   
bayes_sql_password   

Obviously I hid the data that I didn't want to show with *. When I run
sa-learn it trains into the mysql database just fine, I assume SA connects
to it just fine because of that.

Robert

-Original Message-
From: Matt Kettler [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 09, 2006 1:32 PM
To: Robert Bartlett
Cc: users@spamassassin.apache.org
Subject: Re: rules better than bayes?

Robert Bartlett wrote:
> Interesting, I did that just to see how mine were doing and the BAYES 
> one returned 0? Does that mean bayes is not being used? I have been 
> feeding emails to bayes and in debug mode it shows bayes being used. I 
> am using bayes in a mysql. Just weird that its showing 0.
> 

That sounds a lot like you're training bayes into mysql, but when mail comes
in and gets scanned, it's either not using SQL, or it's not using the same
table.

Usually this is a problem with username, where your training is occurring as
"root" but your scanning is occurring as "nobody".

You might want to try using the bayes_sql_override_username option, to force
a single site-wide bayes database, instead of having one per userid
executing SA.
(note: that's per userid EXECUTING SA.. not per email recipient.)






Re: rules better than bayes?

2006-01-09 Thread Matt Kettler
Robert Bartlett wrote:
> Interesting, I did that just to see how mine were doing and the BAYES one
> returned 0? Does that mean bayes is not being used? I have been feeding
> emails to bayes and in debug mode it shows bayes being used. I am using
> bayes in a mysql. Just weird that its showing 0.
> 

That sounds a lot like you're training bayes into mysql, but when mail comes in
and gets scanned, it's either not using SQL, or it's not using the same table.

Usually this is a problem with username, where your training is occurring as
"root" but your scanning is occurring as "nobody".

You might want to try using the bayes_sql_override_username option, to force a
single site-wide bayes database, instead of having one per userid executing SA.
(note: that's per userid EXECUTING SA.. not per email recipient.)





RE: rules better than bayes?

2006-01-09 Thread Bowie Bailey
 wrote:
> I have since taken bayes out as I get WAY better results without it. 

If it doesn't work for you, don't use it.  The rules and network tests
work pretty well.  Especially if you add some SARE rules into the mix.

However...

> The reason this happens to me is that I get to many spam mailings
> that poison the db and I end up with allot of spam that shows up as a
> Bayes_00.

That sounds like you have a poorly trained db.  Did you do manual
training or leave it up to the automatic training?

There is really no such thing as bayes poison.  There are only words
that appear frequently in spam and words that don't appear frequently in
spam.  If the spammers drop a bunch of random garbage into their spam,
that's just more stuff for bayes to analyze.  Most likely, it will be
stuff that you wouldn't normally see in your ham mails anyway.

> I use all the Network tests but I get allot of spam that
> has not been added yet.

Network tests are good for spam runs that have been around for awhile.
For newer spams, bayes and some of the more generic rules are where you
will get most of your hits.

-- 
Bowie


RE: rules better than bayes?

2006-01-09 Thread Robert Bartlett
Interesting, I did that just to see how mine were doing and the BAYES one
returned 0? Does that mean bayes is not being used? I have been feeding
emails to bayes and in debug mode it shows bayes being used. I am using
bayes in a mysql. Just weird that its showing 0.

Robert

-Original Message-
From: Matt Kettler [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 09, 2006 1:05 PM
To: Matthew Yette
Cc: users@spamassassin.apache.org
Subject: Re: rules better than bayes?

Matthew Yette wrote:
> 
> Do you recommend running airmax as a supplementary ruleset with 3.1.0?

I personally have no recommendations on it.. I've never run it.

I personally like SARE's specific, evilnumbers, random and adult rulesets.


Here's some quick grep's for hit-rates on some SARE rules I use (no
declarations  about FPs vs real spam hits, but none of these sets have
caused me any problems so far)

70_sare_evilnum0.cf & 70_sare_evilnum1.cf:
  grep SARE_EN_ /var/log/maillog |wc -l
301
70_sare_specific.cf:
  grep SARE_SPEC_ /var/log/maillog |wc -l
 60
70_sare_genlsubj0.cf:
  grep SARE_SUB /var/log/maillog |wc -l
 44
70_sare_adult.cf:
  grep SARE_ADLT /var/log/maillog |wc -l
 31
70_sare_uri0.cf:
  grep SARE_URI_ /var/log/maillog |wc -l
 10
70_sare_random.cf:
  grep SARE_RAND_ /var/log/maillog |wc -l
  1


I also strongly recommend enabling SA's URIBL support, and adding on a .cf
file to get uribl.com's list added in (default SA only uses surbl.org lists)

  grep URIBL_BLACK /var/log/maillog |wc -l
   2214

  grep _SURBL /var/log/maillog |wc -l
   2144

And of course I get great results from bayes:
  grep BAYES_99 /var/log/maillog |wc -l
   2190

Ditto DCC and Razor2:
 grep RAZOR2_CHECK /var/log/maillog |wc -l
   2114
 grep DCC_CHECK /var/log/maillog |wc -l
   1833



Re: rules better than bayes?

2006-01-09 Thread Matt Kettler
Matthew Yette wrote:
> 
> Do you recommend running airmax as a supplementary ruleset with 3.1.0?

I personally have no recommendations on it.. I've never run it.

I personally like SARE's specific, evilnumbers, random and adult rulesets.


Here's some quick grep's for hit-rates on some SARE rules I use (no declarations
 about FPs vs real spam hits, but none of these sets have caused me any problems
so far)

70_sare_evilnum0.cf & 70_sare_evilnum1.cf:
  grep SARE_EN_ /var/log/maillog |wc -l
301
70_sare_specific.cf:
  grep SARE_SPEC_ /var/log/maillog |wc -l
 60
70_sare_genlsubj0.cf:
  grep SARE_SUB /var/log/maillog |wc -l
 44
70_sare_adult.cf:
  grep SARE_ADLT /var/log/maillog |wc -l
 31
70_sare_uri0.cf:
  grep SARE_URI_ /var/log/maillog |wc -l
 10
70_sare_random.cf:
  grep SARE_RAND_ /var/log/maillog |wc -l
  1


I also strongly recommend enabling SA's URIBL support, and adding on a .cf file
to get uribl.com's list added in (default SA only uses surbl.org lists)

  grep URIBL_BLACK /var/log/maillog |wc -l
   2214

  grep _SURBL /var/log/maillog |wc -l
   2144

And of course I get great results from bayes:
  grep BAYES_99 /var/log/maillog |wc -l
   2190

Ditto DCC and Razor2:
 grep RAZOR2_CHECK /var/log/maillog |wc -l
   2114
 grep DCC_CHECK /var/log/maillog |wc -l
   1833


Re: rules better than bayes?

2006-01-09 Thread Mike Jackson

Do you recommend running airmax as a supplementary ruleset with 3.1.0?


This is just my humble opinion, but I don't know if that's a ruleset I would 
use in production for a multi-user server. A few of the rules use the 
"f-word" in the rule description line, so it would go out in a verbose 
report. The rules seem pretty random and unfocused, and scored based on gut 
instinct rather than rigorous testing. 



Re: rules better than bayes?

2006-01-09 Thread M. Lewis

Matthew Yette wrote:




Correction, airmax.cf is not one single rule, it's one single FILE containing
211 rules. That's a significant difference, given that the stock spamassassin
3.1.0 has about 723 rules.

Airmax has increased the number of rules in your system by 29.1%







Do you recommend running airmax as a supplementary ruleset with 3.1.0?



There's an additional downside to airmax. It has excerpts from *lots* of 
SARE rules. If a SARE rule gets updated, will it be updated in airmax.cf?


YMMV,
M

--

 Overflow on /dev/null; please empty the bit bucket.
  14:50:01 up 1 day, 10:21,  5 users,  load average: 0.04, 0.14, 0.11

 Linux Registered User #241685  http://counter.li.org


Re: rules better than bayes?

2006-01-09 Thread mouss
Matt Kettler a écrit :
> 
> 
> Realistically, I don't know why your hit rates are so low. They shouldn't be 
> so
> bad that you're only detecting 2 or 3 out of every hundred.
> 
> You could have some configuration problems, but I can't tell as you've not 
> told
> us anything about your system, just the problems you have.
> 
> Can you answer a few questions that might help us diagnose some of your 
> problems:
> 
> What version of SA are you running?
> 
> Can you post an X-Spam-Status header for one of the false negatives?
> 
> Is any of your spam hitting ALL_TRUSTED?
> 
> What BAYES rules are these messages hitting before and after training?
> 
> Do you use any network checks (URIBLs, RBLs, DCC, Razor, Pyzor, SPF)?
> 

also, a common error is to run SA as a user, but train it as another one.


Re: rules better than bayes?

2006-01-09 Thread qqqq
I have since taken bayes out as I get WAY better results without it.  The 
reason this happens to me is that I get to many spam
mailings that poison the db and I end up with allot of spam that shows up as a 
Bayes_00.  I use all the Network tests but I get
allot of spam that has not been added yet.



- Original Message - 
From: "jo3" <[EMAIL PROTECTED]>
To: 
Sent: Monday, January 09, 2006 12:27 PM
Subject: rules better than bayes?


| Hi,
|
| This is an observation, please take it in the spirit in which it is
| intended, it is not meant to be flame bait.
|
| After using spamassassin for six solid months, it seems to me that the
| bayes process (sa-learn [--spam | --ham]) has only very limited success
| in learning about new spam.  Regardless of how many spams and hams are
| submitted, the effectiveness never goes above the default level which,
| in our case here, is somewhere around 2 out of 3 spams correctly
| identified.  By the same token, after adding the "third party" rule,
| airmax.cf, the effectiveness went up to 99 out of 100 spams correctly
| identified.
|
| So far, we have not had a single ham misidentified as spam with over one
| million messages examined.
|
| Throughout the documentation, there seems to be a bias toward the bayes
| filter rather than the rule system.  Does anyone on the list have some
| thoughts which would help to explain my observation as to why a single
| rule would appear so successful while a million spams and hams would
| have so little effect?
|
| Thank you,
| Jo3
|
|



Re: rules better than bayes?

2006-01-09 Thread Matthew Yette



On 1/9/06 2:43 PM, "Matt Kettler" <[EMAIL PROTECTED]> wrote:

> jo3 wrote:
>> Hi,
>> 
>> This is an observation, please take it in the spirit in which it is
>> intended, it is not meant to be flame bait.
>> 
>> After using spamassassin for six solid months, it seems to me that the
>> bayes process (sa-learn [--spam | --ham]) has only very limited success
>> in learning about new spam.  Regardless of how many spams and hams are
>> submitted, the effectiveness never goes above the default level which,
>> in our case here, is somewhere around 2 out of 3 spams correctly
>> identified.  By the same token, after adding the "third party" rule,
>> airmax.cf, the effectiveness went up to 99 out of 100 spams correctly
>> identified.
> 
> 
> Realistically, I don't know why your hit rates are so low. They shouldn't be
> so
> bad that you're only detecting 2 or 3 out of every hundred.
> 
> You could have some configuration problems, but I can't tell as you've not
> told
> us anything about your system, just the problems you have.
> 
> Can you answer a few questions that might help us diagnose some of your
> problems:
> 
> What version of SA are you running?
> 
> Can you post an X-Spam-Status header for one of the false negatives?
> 
> Is any of your spam hitting ALL_TRUSTED?
> 
> What BAYES rules are these messages hitting before and after training?
> 
> Do you use any network checks (URIBLs, RBLs, DCC, Razor, Pyzor, SPF)?
> 
> 
>> 
>> So far, we have not had a single ham misidentified as spam with over one
>> million messages examined.
>> 
>> Throughout the documentation, there seems to be a bias toward the bayes
>> filter rather than the rule system.  Does anyone on the list have some
>> thoughts which would help to explain my observation as to why a single
>> rule would appear so successful while a million spams and hams would
>> have so little effect?
>> 
> 
> Correction, airmax.cf is not one single rule, it's one single FILE containing
> 211 rules. That's a significant difference, given that the stock spamassassin
> 3.1.0 has about 723 rules.
> 
> Airmax has increased the number of rules in your system by 29.1%
> 
> 
> 
> 
> 
Do you recommend running airmax as a supplementary ruleset with 3.1.0?
-- 
Matthew Yette
Senior Engineer (NOC/Operations)
M.A. Polce Consulting
315-838-1644



Re: rules better than bayes?

2006-01-09 Thread Matt Kettler
jo3 wrote:
> Hi,
> 
> This is an observation, please take it in the spirit in which it is
> intended, it is not meant to be flame bait.
> 
> After using spamassassin for six solid months, it seems to me that the
> bayes process (sa-learn [--spam | --ham]) has only very limited success
> in learning about new spam.  Regardless of how many spams and hams are
> submitted, the effectiveness never goes above the default level which,
> in our case here, is somewhere around 2 out of 3 spams correctly
> identified.  By the same token, after adding the "third party" rule,
> airmax.cf, the effectiveness went up to 99 out of 100 spams correctly
> identified.


Realistically, I don't know why your hit rates are so low. They shouldn't be so
bad that you're only detecting 2 or 3 out of every hundred.

You could have some configuration problems, but I can't tell as you've not told
us anything about your system, just the problems you have.

Can you answer a few questions that might help us diagnose some of your 
problems:

What version of SA are you running?

Can you post an X-Spam-Status header for one of the false negatives?

Is any of your spam hitting ALL_TRUSTED?

What BAYES rules are these messages hitting before and after training?

Do you use any network checks (URIBLs, RBLs, DCC, Razor, Pyzor, SPF)?


> 
> So far, we have not had a single ham misidentified as spam with over one
> million messages examined.
> 
> Throughout the documentation, there seems to be a bias toward the bayes
> filter rather than the rule system.  Does anyone on the list have some
> thoughts which would help to explain my observation as to why a single
> rule would appear so successful while a million spams and hams would
> have so little effect?
> 

Correction, airmax.cf is not one single rule, it's one single FILE containing
211 rules. That's a significant difference, given that the stock spamassassin
3.1.0 has about 723 rules.

Airmax has increased the number of rules in your system by 29.1%