Re: Those "Re: good obfupills" spams (bayes scores)

2006-04-29 Thread jdow

From: "Bart Schaefer" <[EMAIL PROTECTED]>

On 4/29/06, Matt Kettler <[EMAIL PROTECTED]> wrote:

 In SA 3.1.0 they did force-fix the scores of the bayes rules,
particularly the high-end. The perceptron assigned BAYES_99 a score of
1.89 in the 3.1.0 mass-check run. The devs jacked it up to 3.50.

That does make me wonder if:
1) When BAYES_9x FPs, it FPs in conjunction with lots of other rules
due to the ham corpus being polluted with spam.


My recollection is that there was speculation that the BAYES_9x rules
were scored "too low" not because they FP'd in conjunction with other
rules, but because against the corpus they TRUE P'd in conjunction
with lots of other rules, and that it therefore wasn't necessary for
the perceptron to assign a high score to BAYES_9x in order to push the
total over the 5.0 threshold.

The trouble with that is that users expect training on their personal
spam flow to have a more significant effect on the scoring.  I want to
train bayes to compensate for the LACK of other rules matching, not
just to give a final nudge when a bunch of others already hit.

I filed a bugzilla some while ago suggesting that the bayes percentage
ought to be used to select a rule set, not to adjust the score as a
component of a rule set.

<< jdow >> There is one other gotcha. I bet vastly different scores
are warranted for Bayes when run with per user training and rules
as compared to global training and rules.

{^_^}


Re: Those "Re: good obfupills" spams (bayes scores)

2006-04-29 Thread jdow

From: "Matt Kettler" <[EMAIL PROTECTED]>


Bart Schaefer wrote:

On 4/29/06, Matt Kettler <[EMAIL PROTECTED]> wrote:

Besides.. If you want to make a mathematics based argument against me,
start by explaining how the perceptron mathematically is flawed. It
assigned the original score based on real-world data.


Did it?  I thought the BAYES_* scores have been fixed values for a
while now, to force the perceptron to adapt the other scores to fit.


Actually, you're right..I'm shocked and floored, but you're right.

In SA 3.1.0 they did force-fix the scores of the bayes rules,
particularly the high-end. The perceptron assigned BAYES_99 a score of
1.89 in the 3.1.0 mass-check run. The devs jacked it up to 3.50.

That does make me wonder if:
   1) When BAYES_9x FPs, it FPs in conjunction with lots of other rules
due to the ham corpus being polluted with spam. This forces the
perceptron to attempt to compensate.  (Pollution always is a problem
since nobody is perfect, but it occurs to differing degrees).
  -or-
   2) The perceptron is out-of whack. (I highly doubt this because the
perceptron generated the ones for 3.0.x and they were fine)
 -or-
   3) The Real-world FPs of BAYES_99 really do tend to also be cascades
with other rules in the 3.1.x ruleset, and the perceptron is correctly
capping the score. This could differ from 3.0.x due to change in rules,
or change in ham patterns over time.
 -or-
   4) one of the corpus submitters has a poorly trained bayes db.
(possible, but I doubt it)

Looking at statistics-set3 for 3.0.x and 3.1.x there was a slight
increase in ham-hits for BAYES_99 and a slight decrease in spam hits.
3.0.x:
OVERALL%   SPAM% HAM% S/ORANK   SCORE  NAME
43.515 89.3888 0.0335 1.000 0.83 1.89 BAYES_99
3.1.x:
OVERALL%   SPAM% HAM% S/ORANK   SCORE  NAME
60.712 86.7351 0.0396 1.000 0.90 3.50 BAYES_99

Also to consider is set3 of 3.0.x was much closer to a 50/50 mix of
spam/nonspam (48.7/51.3) than 3.1.0 was (nearly 70/30)


What happens comes from the basic reality that Bayes and the other
rules are not orthogonal sets. So many other rules hit 95 and 99 that
the perceptron artificially reduced the goodness rating for these rules.

It needs some serious skewing to catch situations where 95 or 99 hit and
very few other rules hit. Those are the times the accuracy of Bayes is
needed the most. I've found, here, that 5.0 is a suitable score. I
suspect if I were more realistic 4.9 would be closer. But I still do
remember learning the score bias and being floored by it when I noticed
99 on some spams that leaked through with ONLY the 99 hit. I am speaking
of dozens of spams hit that way.

So far over several years I've found a few special cases that warrant
negative rules. That seems to be pulling the 99 rule's false alarm
rate down to "I can't see it." (I have, however, been tempted to generate
a BAYES_99p5 rule and a BAYES_99p9 rule to fine tune the scores up around
4.9 and 5.0.)

{^_


Re: Those "Re: good obfupills" spams

2006-04-29 Thread jdow

From: "Matt Kettler" <[EMAIL PROTECTED]>

List Mail User wrote:

Matt Kettler replied:

John Tice wrote:


Greetings,
This is my first post after having lurked some. So, I'm getting these
same "RE: good" spams but they're hitting eight rules and typically
scoring between 30 and 40. I'm really unsophisticated compared to you
guys, and it begs the question––what am I doing wrong? All I use is a
tweaked user_prefs wherein I have gradually raised the scores on
standard rules found in spam that slips through over a period of time.
These particular spams are over the top on bayesian (1.0), have
multiple database hits, forged rcvd_helo and so forth. Bayesian alone
flags them for me. I'm trying to understand the reason you would not
want to have these type of rules set high enough? I must be way over
optimized––what am I not getting?


BAYES_99, by definition, has a 1% false positive rate.



If we were to presume a uniform distribution between a estimate of
99% and 100%, then the FP rate would be .5%, not 1%.

You're right Paul, my bad..

But again, I don't care if it's 0.01%. The question here is "is jacking
up the score of BAYES_99 to be greater than required_hits a good idea".
The answer is "No, because BAYES_99 is NOT a 100% accurate test. By
definition it does have a non-zero FP rate.


I run AT 5.0. When I see my first false alarm solely from BAYES_99
I will reduce it slightly. I know what theory says. I also know that
BAYES_99 alone captures more spam than it has ever captured ham for
false imprisonment.


 And for large sites
(i.e. 10s or thousands or messages a day or more), this may be what occurs;
But what I see and what I assume many other small sites see is a very much
non-uniform distribution;  From the last 30 hours, the average estimate (re.
the value reported in the "bayes=xxx" clause) for spam hitting the BAYES_99
rule is .41898013269 with about two thirds of them reporting bayes=1 and
a lowest value of bayes=0.998721756590216.


Yes, that's to be expected with Chi-Squared combining.

While SA is quite robust largely because of the design feature that
no single reason/cause/rule should by itself mark a message as spam, I have
to guess that the FP rate that the majority of users see for BAYES_99 is far
below 1%.  From the estimators reported above, I would expect that I would
have seen a .003% FP rate for the last day plus a little, if only I received
100,000 or so spam messages to have been able to see it:).


True, but it's still not nearly zero. Even in the corpus testing, which
is run by "the best of the best" in SA administration and maintenance,
BAYES_99 matched 0.0396% of ham, or 21 out of 53,091 hams. (Based on
set-3 of SA 3.1.0)


And it is scored LESS than BAYES_95 by default. That's a clear signal
that the theory behind the scoring system is a little skewed and needs
some rethinking.


Given we are dealing with user who doesn't even understand why you might
not want this set "high enough", I would expect the level of
sophistication in bayes maintenance

Besides.. If you want to make a mathematics based argument against me,
start by explaining how the perceptron mathematically is flawed. It
assigned the original score based on real-world data. Not our vast over
simplifications. You should have good reason to question its design
before second guessing it's scoring based on speculation such as this.


When it can give BAYES_99 a score LOWER than BAYES_95 it clearly has
a conceptual problem. (It also indicates that automatic Bayes filter
training has its own conceptual flaws.)


I don't change the scoring from the defaults, but if people were to
want to, maybe they could change the rules (or add a rule) for BAYES_99_99
which would take only scores higher than bayes=. and which (again with
a uniform distribution) have an expected FP rate of .005% - than re-score
that just closer (but still less) than the spam threshold,


I'd agree.. However, the OP has already made BAYES_99 > required_hits.
Bad idea. Period.


5.0 is, admittedly marginal. 6 or 7 is not a good idea. Not enough rules
exist that will pull it back down. (Thinking on that I suspect there are
some SARE rules that should lower the score slightly when they are not
hit.)

{^_^} 



Re: Tracking Compound Meta's

2006-04-29 Thread David B Funk
On Fri, 28 Apr 2006, Dan wrote:

> > It looks like it might have some interesting purposes. But for the
> > most part, I can't think of what you would use it for. I can't
> > think of a single example where SARE could have used this before.
>
> Actually, the way I expect to use it is more like:
>
>   __test [A1 - A3]
>   __test [B1 - B3]
>   __test [C1 - C3]
>   __test [D1 - D3]
>
>   meta __META_A (__testA1 || __testA2 || __testA3)
[snip..]
> Still pretty new to SA, I'm in the middle of building my system and
> was hoping to find preexisting features I could simply build my
> configuration around.  If micro weighting (.001) doesn't work, I'll
> make a feature request after deciding the best way to do what I'm
> after.  Thinking about it today, my ideal would be:
>
> 1) An option to turn off scoring for specific tests WITHOUT turning
> off its event reporting.  Perhaps a different prefix, like  ++test
> instead of  __test.
>
> AND
>
> 2) A logging system that records EVERY test involved for EVERY
> message scanned, that also allows me to locate the correct entry
> (with a text editor) when all I have is the Subject: or From: of a
> given message.

What about using the SA 'test rule' mechanism?
(IE use "T_testA1" rather than "__testA1").
Effectivly the micro weighting done automagically and in a standardized
way.

Read the SA conf documentation for details.

-- 
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{


Re: Those "Re: good obfupills" spams (bayes scores)

2006-04-29 Thread Bart Schaefer

On 4/29/06, Matt Kettler <[EMAIL PROTECTED]> wrote:

 In SA 3.1.0 they did force-fix the scores of the bayes rules,
particularly the high-end. The perceptron assigned BAYES_99 a score of
1.89 in the 3.1.0 mass-check run. The devs jacked it up to 3.50.

That does make me wonder if:
1) When BAYES_9x FPs, it FPs in conjunction with lots of other rules
due to the ham corpus being polluted with spam.


My recollection is that there was speculation that the BAYES_9x rules
were scored "too low" not because they FP'd in conjunction with other
rules, but because against the corpus they TRUE P'd in conjunction
with lots of other rules, and that it therefore wasn't necessary for
the perceptron to assign a high score to BAYES_9x in order to push the
total over the 5.0 threshold.

The trouble with that is that users expect training on their personal
spam flow to have a more significant effect on the scoring.  I want to
train bayes to compensate for the LACK of other rules matching, not
just to give a final nudge when a bunch of others already hit.

I filed a bugzilla some while ago suggesting that the bayes percentage
ought to be used to select a rule set, not to adjust the score as a
component of a rule set.


Re: Those "Re: good obfupills" spams (bayes scores)

2006-04-29 Thread Matt Kettler
Bart Schaefer wrote:
> On 4/29/06, Matt Kettler <[EMAIL PROTECTED]> wrote:
>> Besides.. If you want to make a mathematics based argument against me,
>> start by explaining how the perceptron mathematically is flawed. It
>> assigned the original score based on real-world data.
>
> Did it?  I thought the BAYES_* scores have been fixed values for a
> while now, to force the perceptron to adapt the other scores to fit.
>
Actually, you're right..I'm shocked and floored, but you're right.

 In SA 3.1.0 they did force-fix the scores of the bayes rules,
particularly the high-end. The perceptron assigned BAYES_99 a score of
1.89 in the 3.1.0 mass-check run. The devs jacked it up to 3.50.

That does make me wonder if:
1) When BAYES_9x FPs, it FPs in conjunction with lots of other rules
due to the ham corpus being polluted with spam. This forces the
perceptron to attempt to compensate.  (Pollution always is a problem
since nobody is perfect, but it occurs to differing degrees).
   -or-
2) The perceptron is out-of whack. (I highly doubt this because the
perceptron generated the ones for 3.0.x and they were fine)
  -or-
3) The Real-world FPs of BAYES_99 really do tend to also be cascades
with other rules in the 3.1.x ruleset, and the perceptron is correctly
capping the score. This could differ from 3.0.x due to change in rules,
or change in ham patterns over time.
  -or-
4) one of the corpus submitters has a poorly trained bayes db.
(possible, but I doubt it)

Looking at statistics-set3 for 3.0.x and 3.1.x there was a slight
increase in ham-hits for BAYES_99 and a slight decrease in spam hits.
3.0.x:
OVERALL%   SPAM% HAM% S/ORANK   SCORE  NAME
43.515 89.3888 0.0335 1.000 0.83 1.89 BAYES_99
3.1.x:
OVERALL%   SPAM% HAM% S/ORANK   SCORE  NAME
60.712 86.7351 0.0396 1.000 0.90 3.50 BAYES_99

Also to consider is set3 of 3.0.x was much closer to a 50/50 mix of
spam/nonspam (48.7/51.3) than 3.1.0 was (nearly 70/30)



Re: Those "Re: good obfupills" spams

2006-04-29 Thread Bart Schaefer

On 4/29/06, Matt Kettler <[EMAIL PROTECTED]> wrote:

Besides.. If you want to make a mathematics based argument against me,
start by explaining how the perceptron mathematically is flawed. It
assigned the original score based on real-world data.


Did it?  I thought the BAYES_* scores have been fixed values for a
while now, to force the perceptron to adapt the other scores to fit.


Re: OT spammers

2006-04-29 Thread David Gibbs
Igor Chudov wrote:
> Here's something that I do not understand. What is the point of
> spamming people repeatedly not once, twice, or even 10 times, but
> hundreds of times. If I wanted to procure pils, or pgrn, or whatever,
> I would have done it on the first 10 spams. After 100 or so spams,
> what is the benefit of sending me yet more spam? I seem to receive
> some spams, such as about getting fake education, way over 100 times. 

Because it works.

Scary to think that some people are that stupid.

david



Re: Those "Re: good obfupills" spams

2006-04-29 Thread Bart Schaefer

On 4/29/06, List Mail User <[EMAIL PROTECTED]> wrote:


While SA is quite robust largely because of the design feature that
no single reason/cause/rule should by itself mark a message as spam, I have
to guess that the FP rate that the majority of users see for BAYES_99 is far
below 1%.



Anyway, to better address the OP's questions:  The system is more
robust if instead of changing the weighting of existing rules (assuming that
they were correctly established to begin with), you add more possible inputs


Exactly.  For example, I find that anything in the subset consisting
of messages that don't mention my email address anywhere in the To/Cc
headers and also scoring above BAYES_70 has close to 100% likelyhood
of being spam.  However, since I also get quite a lot of mail that
doesn't fall into that subset, I can't simply increase the scores for
the BAYES rules.

In this case I use procmail to examine the headers after SA has scored
the message, but I've been considering creating a meta-rule of some
kind.  Trouble is, SA doesn't know what "my email address" means (it'd
need to be a list of addresses), and I'm reluctant to turn on
allow_user_rules.


Re: SA & Razor problem - help requested

2006-04-29 Thread Rainer Sokoll
On Sat, Apr 29, 2006 at 01:07:28PM -0400, Theo Van Dinter wrote:

> On Sat, Apr 29, 2006 at 06:16:36PM +0200, Rainer Sokoll wrote:
> > loadplugin Mail::SpamAssassin::Plugin::Razor2
> 
> don't do that in a cf file..

Moved to init.pre

> What does the output from:
> 
> spamassassin --lint -D razor2
> 
> look like?

dbg: razor2: razor2 is available, version 2.81

Nothing else :-(

Rainer


Re: SA & Razor problem - help requested

2006-04-29 Thread Theo Van Dinter
On Sat, Apr 29, 2006 at 06:16:36PM +0200, Rainer Sokoll wrote:
> loadplugin Mail::SpamAssassin::Plugin::Razor2

don't do that in a cf file..

> Any suggestions?

What does the output from:

spamassassin --lint -D razor2

look like?

-- 
Randomly Generated Tagline:
"What is a lie but the truth in masquerade." - Byron


pgprXrmwcKf1B.pgp
Description: PGP signature


KMail and spamassassin question

2006-04-29 Thread Spiro Angeli
Hi,

I run Gentoo linux and kde 3.5.2 with kmail
Currently I have configured and installed SpamAssassin version 3.1.0
I configured SA to run as demone against KMail running as plug-in. So, anytime 
I receive mail through KMail, SA filters all mail.

I have few questions reguarding how SA filters mail and how it integrates with 
KMail. I would appreciate some explanations if possible, as already asked at 
kmail mailing list but not able to provide me with answers.

1) after I installed SA, do I / is it advised to install rules already made, 
like spamassassin-ruledujour, and more like them? If so, how to implement 
them? Just install them and that is all?

2) I have some recurrent spam mail that even if I train SA they are not 
filtered. What to do?

3) Are there different ways of setting up SA and KMail?

4) How to create a whitelist with SA and KMail where I can see a list of all 
my white list members?

5) Any good book to purchase to learn SA properly?

Thank you,
Spiro


Re: SA & Razor problem - help requested

2006-04-29 Thread Rainer Sokoll
On Sat, Apr 29, 2006 at 10:39:48AM -0400, Theo Van Dinter wrote:

> the third thing in the UPGRADE doc:
> 
> - Due to license restrictions the DCC and Razor2 plugins are disabled
>   by default. [...]

OK, in my local.cf I have:

loadplugin Mail::SpamAssassin::Plugin::Razor2
ifplugin Mail::SpamAssassin::Plugin::Razor2
  use_razor2 1
  razor_config /home/vscan/.razor/razor-agent.conf
endif

A test gives me:

[EMAIL PROTECTED]:~> /usr/local/perl-5.8.8/bin/spamassassin -D --lint \
--config-file=/tmp/local.cf 2>&1 | grep -i razor
[25516] dbg: diag: module installed: Razor2::Client::Agent, version 2.81
[25516] dbg: plugin: loading Mail::SpamAssassin::Plugin::Razor2 from @INC
[25516] dbg: razor2: razor2 is available, version 2.81
[25516] dbg: plugin: registered 
Mail::SpamAssassin::Plugin::Razor2=HASH(0x90f8dec)
[25516] dbg: plugin: loading Mail::SpamAssassin::Plugin::Razor2 from @INC
[25516] dbg: razor2: razor2 is available, version 2.81
[25516] dbg: plugin: did not register 
Mail::SpamAssassin::Plugin::Razor2=HASH(0x8fc5f1c), already registered
[EMAIL PROTECTED]:~>

Beside the fact that Razor2 seems to be loaded two times: I expect to
see something similar to
http://wiki.apache.org/spamassassin/RazorHowToTell?
If I do a razor-check manually, razor seems to work fine.
Any suggestions?

Thank you,
Rainer


Re: SA & Razor problem - help requested

2006-04-29 Thread David Flanigan
Theo, 

Thanks for this. Now I feel stubid for bother the list. I have been running SA 
for 
some time, and didn't notice that change. My bad. 

Thanks for the quick reply!
Dave

On Sat, 29 Apr 2006 10:39:48 -0400, Theo Van Dinter wrote
> On Sat, Apr 29, 2006 at 08:58:42AM -0400, David Flanigan wrote:
> > (http://www.flanigan.net/spam) seen even a single RAZOR hit. However, I get 
> > no 
errors 
> > in the error logs. The only error I see is on a `spamassassin –lint` which 
> > says:
> > 
> > [8611] warn: config: failed to parse line, skipping: use_razor2__1
> > [8611] warn: config: failed to parse line, skipping: razor_config 
> > __/etc/mail/spamassassin/.razor/razor-agent.conf
> 
> enable the razor plugin in v310.pre.
> 
> > Oddly, I get the exact same symptoms with DCC.
> 
> ditto.
> 
> > I have searched the mailing list, and and I followed the wiki guide at 
> > spamassassin.apache.org for installing razor with SA. I have verifed that 
> > both SA 
and 
> > Razor work on there own, and have fed razor several test messages. SA works 
> > fine 
other 
> > than the razor problems.
> 
> the third thing in the UPGRADE doc:
> 
> - Due to license restrictions the DCC and Razor2 plugins are disabled
>   by default. [...]
> 
> -- 
> Randomly Generated Tagline:
> "... then you'll excuse me, but I'm in the middle of fifteen things, all of
>  them annoying."
>  - Ivonova, Babylon 5 (Midnight on the Firing Line)


---
Kind Regards,
David

http://www.flanigan.net



Re: SA & Razor problem - help requested

2006-04-29 Thread Theo Van Dinter
On Sat, Apr 29, 2006 at 08:58:42AM -0400, David Flanigan wrote:
> (http://www.flanigan.net/spam) seen even a single RAZOR hit. However, I get 
> no errors 
> in the error logs. The only error I see is on a `spamassassin –lint` which 
> says:
> 
> [8611] warn: config: failed to parse line, skipping: use_razor2__1
> [8611] warn: config: failed to parse line, skipping: razor_config 
> __/etc/mail/spamassassin/.razor/razor-agent.conf

enable the razor plugin in v310.pre.

> Oddly, I get the exact same symptoms with DCC. 

ditto.

> I have searched the mailing list, and and I followed the wiki guide at 
> spamassassin.apache.org for installing razor with SA. I have verifed that 
> both SA and 
> Razor work on there own, and have fed razor several test messages. SA works 
> fine other 
> than the razor problems. 

the third thing in the UPGRADE doc:

- Due to license restrictions the DCC and Razor2 plugins are disabled
  by default. [...]

-- 
Randomly Generated Tagline:
"... then you'll excuse me, but I'm in the middle of fifteen things, all of
 them annoying."
 - Ivonova, Babylon 5 (Midnight on the Firing Line)


pgpsJdJGsNHKH.pgp
Description: PGP signature


Re: Those "Re: good obfupills" spams

2006-04-29 Thread John Tice


Thank you all for the comments. My personal experience is that  
Bayes_99 is amazingly reliable––close to 100% for me. I formerly had  
it set to 4.5 so that bayes_99 plus one other hit would flag it, but  
then I started getting some spam that were not hit by any other rule,  
yet bayes correctly identified them. It seems more effective to write  
some negative scoring ham rules specific to my important content  
rather than to take less than full advantage of the high accuracy of  
bayes. And, the spams in question in this thread are hitting multiple  
rules, so should be catchable without having bayes_99 set over the top.


I suppose all these judgments must take into account one's  
preferences, degree of aversion to FPs, and the diversity of content  
you're working with. Hopefully I will improve accuracy by writing/ 
adding custom rules and be able to back off the scoring of standard  
rules, but I have been fairly successful (by my own definition) at  
tweaking standard rules with minimal FPs. At times when I do get a FP  
I take a look at it and think "this one just deserves to get  
filtered." I'm willing to accept a certain amount, or a certain type,  
in order to be aggressive against spam. Before I only had access to  
user_prefs, but now that I have a server with root access it's a  
brand new ball game. The mechanics are easy enough, but I need to  
work on the broader strategies. Any particularly good reading to be  
recommended?


John







On Apr 29, 2006, at 8:12 AM, List Mail User wrote:


...


Matt Kettler replied:


John Tice wrote:


Greetings,
This is my first post after having lurked some. So, I'm getting  
these

same "RE: good" spams but they're hitting eight rules and typically
scoring between 30 and 40. I'm really unsophisticated compared to  
you
guys, and it begs the question––what am I doing wrong? All I use  
is a

tweaked user_prefs wherein I have gradually raised the scores on
standard rules found in spam that slips through over a period of  
time.

These particular spams are over the top on bayesian (1.0), have
multiple database hits, forged rcvd_helo and so forth. Bayesian  
alone

flags them for me. I'm trying to understand the reason you would not
want to have these type of rules set high enough? I must be way over
optimized––what am I not getting?



BAYES_99, by definition, has a 1% false positive rate.



Matt,

If we were to presume a uniform distribution between a estimate of
99% and 100%, then the FP rate would be .5%, not 1%.  And for large  
sites
(i.e. 10s or thousands or messages a day or more), this may be what  
occurs;
But what I see and what I assume many other small sites see is a  
very much
non-uniform distribution;  From the last 30 hours, the average  
estimate (re.
the value reported in the "bayes=xxx" clause) for spam hitting the  
BAYES_99
rule is .41898013269 with about two thirds of them reporting  
bayes=1 and

a lowest value of bayes=0.998721756590216.

While SA is quite robust largely because of the design feature that
no single reason/cause/rule should by itself mark a message as  
spam, I have
to guess that the FP rate that the majority of users see for  
BAYES_99 is far
below 1%.  From the estimators reported above, I would expect that  
I would
have seen a .003% FP rate for the last day plus a little, if only I  
received

100,000 or so spam messages to have been able to see it:).

I don't change the scoring from the defaults, but if people were to
want to, maybe they could change the rules (or add a rule) for  
BAYES_99_99
which would take only scores higher than bayes=. and which  
(again with
a uniform distribution) have an expected FP rate of .005% - than re- 
score
that just closer (but still less) than the spam threshold, or add a  
point
of fraction thereof to raise the score to just under the spam  
threshhold
(adding a new rule would avoid having to edit distributed files and  
thus

would probably be the "better" method).

Anyway, to better address the OP's questions:  The system is more
robust if instead of changing the weighting of existing rules  
(assuming that
they were correctly established to begin with), you add more  
possible inputs
(and preferably independant ones - i.e. where the FPs between rules  
have a
low correlation).  Simply increasing scores will improve your spam  
"capture"
rate, just as decreasing the spam threshold will - but both methods  
will add
to the likelyhood of false positives;  Look into the distributed  
documentation
to see the expected FP rates at different spam threshold levels for  
numbers
to drive this point home (and changing specific rules' scores is  
just like
changing the threshold, but in a non-uniform fashion - unless you  
actually
measure the values for your own site's mail and recompute numbers  
that are

a better estimate for local traffic).

Paul Shupak
[EMAIL PROTECTED]






Re: SQLite

2006-04-29 Thread Michael Parker
Jonas Eckerman wrote:
> Jakob Hirsch wrote:
> 
>> I don't think SQLite itself is _that_ slow (in fact, I don't think it's
>> slow at all), it's most probably a matter of optimization,
> 
> SQL Lite *can* be very slow at some inserts/updates on some systems
> because of how it handles writes. SQLite creates a temporary file for
> each write operation, and also waits for writes to be safely finished by
> the OS.
> 
> If speed is more important than databse consistency, the SQL command
> 'PRAGMA SYNCHRONOUS=OFF' makes SQLite a *lot* faster. It simply tells
> SQLIte not to wait for every write to be finished.

Yep, I've tried this.

> 
> On a stable system with working backup routines running SQLite with
> 'PRAGMA SYNCHRONOUS=OFF' for bayes makes a lot of sense.
> 
> Is there any easy way to tell SpamAssassins SQL initializatiuon to run
> specific commands directly after opening a database connection?
> Or would it make more sense creating a
> 'Mail::SpamAssassin::BayesStore::SQLite' that does this if told to?

It has been awhile, but I believe you just need to do this at create
time, so you'd only need a proper .sql file that did it.  If you look in
the "Attic" or whatever they call it in subversion, you'll see that
there used to exist SQLite files.

I believe a custom plugin would need to be created to make use of the
transactional capabilities.  However, I've done the work in the past and
discovered it just was not worth it, you were better off sticking with
Berkeley DBD or the MUCH faster SDBM.

That said, that doesn't mean that I wouldn't welcome a contribution from
someone who went off and did the work, so feel free to create the module
and do the testing.  Submit a bug with the code and results attached and
I will strongly consider adding it to the source tree.


> 
> (I'm moving stuff into a SQLite database in a MIMEDefang filter, so I'm
> thinking of trying it out for bayes as well...)
> 
>> If time permits, I'll do a benchmark run, anyway,
> 
> Are there any ready made benchmark scripts for the bayes stuff?

As Matt said you can find it here:
http://wiki.apache.org/spamassassin/BayesBenchmark

The actual benchmark code is here:
http://wiki.apache.org/spamassassin-data/attachments/BayesBenchmark/attachments/benchmark.tar.gz

I think that I've added enough documentation to get you up and running,
but if you have questions, feel free to ask.  Improvements to the
benchmark are also more than welcome.

Thanks
Michael


Re: Those "Re: good obfupills" spams

2006-04-29 Thread Matt Kettler
List Mail User wrote:
>> ...
>> 
>
> Matt Kettler replied:
>
>   
>> John Tice wrote:
>> 
>>> Greetings,
>>> This is my first post after having lurked some. So, I'm getting these
>>> same "RE: good" spams but they're hitting eight rules and typically
>>> scoring between 30 and 40. I'm really unsophisticated compared to you
>>> guys, and it begs the question––what am I doing wrong? All I use is a
>>> tweaked user_prefs wherein I have gradually raised the scores on
>>> standard rules found in spam that slips through over a period of time.
>>> These particular spams are over the top on bayesian (1.0), have
>>> multiple database hits, forged rcvd_helo and so forth. Bayesian alone
>>> flags them for me. I'm trying to understand the reason you would not
>>> want to have these type of rules set high enough? I must be way over
>>> optimized––what am I not getting? 
>>>   
>> BAYES_99, by definition, has a 1% false positive rate.
>>
>> 
>
>   Matt,
>
>   If we were to presume a uniform distribution between a estimate of
> 99% and 100%, then the FP rate would be .5%, not 1%. 
You're right Paul, my bad..

But again, I don't care if it's 0.01%. The question here is "is jacking
up the score of BAYES_99 to be greater than required_hits a good idea".
The answer is "No, because BAYES_99 is NOT a 100% accurate test. By
definition it does have a non-zero FP rate.

>  And for large sites
> (i.e. 10s or thousands or messages a day or more), this may be what occurs;
> But what I see and what I assume many other small sites see is a very much
> non-uniform distribution;  From the last 30 hours, the average estimate (re.
> the value reported in the "bayes=xxx" clause) for spam hitting the BAYES_99
> rule is .41898013269 with about two thirds of them reporting bayes=1 and
> a lowest value of bayes=0.998721756590216.
>   
Yes, that's to be expected with Chi-Squared combining.
>   While SA is quite robust largely because of the design feature that
> no single reason/cause/rule should by itself mark a message as spam, I have
> to guess that the FP rate that the majority of users see for BAYES_99 is far
> below 1%.  From the estimators reported above, I would expect that I would
> have seen a .003% FP rate for the last day plus a little, if only I received
> 100,000 or so spam messages to have been able to see it:).
>   
True, but it's still not nearly zero. Even in the corpus testing, which
is run by "the best of the best" in SA administration and maintenance,
BAYES_99 matched 0.0396% of ham, or 21 out of 53,091 hams. (Based on
set-3 of SA 3.1.0)

Given we are dealing with user who doesn't even understand why you might
not want this set "high enough", I would expect the level of
sophistication in bayes maintenance

Besides.. If you want to make a mathematics based argument against me,
start by explaining how the perceptron mathematically is flawed. It
assigned the original score based on real-world data. Not our vast over
simplifications. You should have good reason to question its design
before second guessing it's scoring based on speculation such as this.

>   I don't change the scoring from the defaults, but if people were to
> want to, maybe they could change the rules (or add a rule) for BAYES_99_99
> which would take only scores higher than bayes=. and which (again with
> a uniform distribution) have an expected FP rate of .005% - than re-score
> that just closer (but still less) than the spam threshold, 

I'd agree.. However, the OP has already made BAYES_99 > required_hits.
Bad idea. Period.



RE: Bayes troubles

2006-04-29 Thread Will Nordmeyer
OK...  I did the greps you recommended and didn't find any use_dcc lines...
I even did:

grep use_dcc /home/sites/*/users/*/.spamassassin/user_prefs and still didn't
find anything (checking all user directories).

(actually, my running SA build is in /home/spam-filter... (bin, share, etc.
- I'm on a cobalt RAQ and can't upgrade the primary PERL to PERL 5.8 - so I
made a little subsystem)

I found the report_contact flag in the 10_misc.cf in both
/usr/share/spamassassin & /home/spam-filter/share/spamassassin.  

I have an old build in /usr/share/spamassassin that I need to delete (thanks
for reminding me).

I think I'll hold out until V3.1.2 is released since, according to traffic
here, it is fairly close.  (Maybe I'll download & install the latest razor
and be razoring as well now).

Thanks Matt!

--Will

-Original Message-
From: Matt Kettler [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 28, 2006 11:40 AM
To: Will Nordmeyer
Cc: users@spamassassin.apache.org
Subject: Re: Bayes troubles

Will Nordmeyer wrote:
> Matt,
> 
> I ran lint this AM (I frequently forget that part :-)), and only had 2
> issues - 
> 
> warn: config: failed to parse line, skipping: use_dcc 1
> warn: config: warning: score set for non-existent rule RAZOR2_CHECK
> 
> I can't find where the use_dcc or the RAZOR2_CHECK are set though.  None
of
> the .cf files in /etc/mail/spamassassin have them.

Perhaps a user_prefs has them.
Or if you have "inherited" a system,someone edited the /usr/share/ files?
Or maybe someone put it in a .pre file in /etc/mail/spamassassin?


grep use_dcc /usr/share/spamassassin/*.cf
grep use_dcc /etc/mail/spamassassin/*.cf
grep use_dcc /etc/mail/spamassassin/*.pre
grep use_dcc ~/.spamassassin/user_prefs

> 
> I tried running the spamassassin --lint --debug and dump the dbg output to
a
> file, but apparently I'm screwing up the redirect because my output file
> always is empty.

You can't redirect the debug output with > or |. It is output to stderr, not
stdout.

In bash type shells you can re-direct stderr using 2> instead of >

> 
> I'm running via spamd and have restarted spamd.  By the way, I'm running
> V3.1.1 (and for some reason it puts @@CONTACT_ADDRESS@@ in the emails
saying
> that spam detection software running on blah blah blah - know how I can
> easily fix that without having to rebuild?).


That makes me fairly concerned about the integrity of the build. I'd
strongly
suggest rebuilding anyway.

That said, you can edit /usr/share/spamassassin/10_misc.cf and edit the
report_contact option there. BE VERY careful editing this, and be sure to
lint
afterwards.

Note: In the general case I would advise against editing any of the .cf
files in
/usr/share/spamassassin. They will all be obliterated and re-written if you
upgrade or re-install. In this case, that's perfectly fine.





SA & Razor problem - help requested

2006-04-29 Thread David Flanigan
Hello Spamasssins, 

 I am having an odd problem, I was hoping for some insight from those more 
adept than 
I. 

I am trying to get Razor working with Spamassassin to little effect. To put it 
simply, 
SA never uses RAZOR, and I have never in thousands of messages 
(http://www.flanigan.net/spam) seen even a single RAZOR hit. However, I get no 
errors 
in the error logs. The only error I see is on a `spamassassin –lint` which says:

[8611] warn: config: failed to parse line, skipping: use_razor2__1
[8611] warn: config: failed to parse line, skipping: razor_config 
__/etc/mail/spamassassin/.razor/razor-agent.conf


Oddly, I get the exact same symptoms with DCC. 

I have compiled SA from scratch and installed in over the existing install just 
to make 
sure. 

I have searched the mailing list, and and I followed the wiki guide at 
spamassassin.apache.org for installing razor with SA. I have verifed that both 
SA and 
Razor work on there own, and have fed razor several test messages. SA works 
fine other 
than the razor problems. 


My config: 
1. I am envoking spamc 3.1.1 through /etc/procmailrc using a simple:
 
:0fw
| /usr/bin/spamc

2. spamd is called with the following args: “-u spamd -d -x -m5 -
H /etc/mail/spamassassin/ -r /var/run/spamd.pid
3. I am running version 2.8.1 of razor clients 
4. I am running the above on Linux Fedora Core 5 (kernel 2.6.16-1.2096_FC5). 
5. My local.cf lines for razor are:

My local.cf has the following lines for razor: 
use_razor2  1
razor_config/etc/mail/spamassassin/.razor/razor-agent.conf


Any advise you could offer would be greatly appreciated!




---
Kind Regards,
David

http://www.flanigan.net



Re: Those "Re: good obfupills" spams

2006-04-29 Thread List Mail User
>...

Matt Kettler replied:

>John Tice wrote:
>>
>> Greetings,
>> This is my first post after having lurked some. So, I'm getting these
>> same "RE: good" spams but they're hitting eight rules and typically
>> scoring between 30 and 40. I'm really unsophisticated compared to you
>> guys, and it begs the question––what am I doing wrong? All I use is a
>> tweaked user_prefs wherein I have gradually raised the scores on
>> standard rules found in spam that slips through over a period of time.
>> These particular spams are over the top on bayesian (1.0), have
>> multiple database hits, forged rcvd_helo and so forth. Bayesian alone
>> flags them for me. I'm trying to understand the reason you would not
>> want to have these type of rules set high enough? I must be way over
>> optimized––what am I not getting? 
>
>
>BAYES_99, by definition, has a 1% false positive rate.
>

Matt,

If we were to presume a uniform distribution between a estimate of
99% and 100%, then the FP rate would be .5%, not 1%.  And for large sites
(i.e. 10s or thousands or messages a day or more), this may be what occurs;
But what I see and what I assume many other small sites see is a very much
non-uniform distribution;  From the last 30 hours, the average estimate (re.
the value reported in the "bayes=xxx" clause) for spam hitting the BAYES_99
rule is .41898013269 with about two thirds of them reporting bayes=1 and
a lowest value of bayes=0.998721756590216.

While SA is quite robust largely because of the design feature that
no single reason/cause/rule should by itself mark a message as spam, I have
to guess that the FP rate that the majority of users see for BAYES_99 is far
below 1%.  From the estimators reported above, I would expect that I would
have seen a .003% FP rate for the last day plus a little, if only I received
100,000 or so spam messages to have been able to see it:).

I don't change the scoring from the defaults, but if people were to
want to, maybe they could change the rules (or add a rule) for BAYES_99_99
which would take only scores higher than bayes=. and which (again with
a uniform distribution) have an expected FP rate of .005% - than re-score
that just closer (but still less) than the spam threshold, or add a point
of fraction thereof to raise the score to just under the spam threshhold
(adding a new rule would avoid having to edit distributed files and thus
would probably be the "better" method).

Anyway, to better address the OP's questions:  The system is more
robust if instead of changing the weighting of existing rules (assuming that
they were correctly established to begin with), you add more possible inputs
(and preferably independant ones - i.e. where the FPs between rules have a
low correlation).  Simply increasing scores will improve your spam "capture"
rate, just as decreasing the spam threshold will - but both methods will add
to the likelyhood of false positives;  Look into the distributed documentation
to see the expected FP rates at different spam threshold levels for numbers
to drive this point home (and changing specific rules' scores is just like
changing the threshold, but in a non-uniform fashion - unless you actually
measure the values for your own site's mail and recompute numbers that are
a better estimate for local traffic).

Paul Shupak
[EMAIL PROTECTED]


Re: span float obfuscation

2006-04-29 Thread MATSUDA Yoh-ichi
Kenneth-san, thank you for your kindly advice.
I've posted new rules to Bugzilla.
But, it's a little bit difficult for me. ^^;

BTW, I have more rules for catching various types of spams.
Which is better for posting new rules?
 (1) first, posting new rules to this users ML, next, posting to Bugzilla
 (2) directly posting new rules to Bugzilla

From: Kenneth Porter <[EMAIL PROTECTED]>
Subject: Re: span float obfuscation
Date: Fri, 28 Apr 2006 10:05:56 -0700

> On Saturday, April 29, 2006 1:48 AM +0900 MATSUDA Yoh-ichi <[EMAIL 
> PROTECTED]> 
> wrote:
> 
> > May I post my rules to Bugzilla?
> 
> Sounds good to me. I would have done so myself but wanted to make sure you 
> get attribution. You'll probably want to subscribe to the -devel list as 
> all bugzilla traffic goes through there. And as the wiki page recommends, 
> attach a sample spam to illustrate what the rule is supposed to catch.
> 
> Once the rule is captured in bugzilla, a dev can get it into the automated 
> testing sandbox and we can see how the rule performs on their corpora.
> 
> 

--
Nothing but a peace sign.
MATSUDA Yoh-ichi(yoh)
mailto:[EMAIL PROTECTED]
http://www.flcl.org/~yoh/diary/ (only Japanese)


Re: Those "Re: good obfupills" spams

2006-04-29 Thread jdow

From: "Loren Wilton" <[EMAIL PROTECTED]>


This is my first post after having lurked some. So, I'm getting these
same "RE: good" spams but they're hitting eight rules and typically
scoring between 30 and 40. I'm really unsophisticated compared to you
guys, and it begs the question––what am I doing wrong? All I use is a
tweaked user_prefs wherein I have gradually raised the scores on
standard rules found in spam that slips through over a period of
time. These particular spams are over the top on bayesian (1.0), have
multiple database hits, forged rcvd_helo and so forth. Bayesian alone
flags them for me. I'm trying to understand the reason you would not
want to have these type of rules set high enough? I must be way over
optimized––what am I not getting?


The danger with tweaking standard rule scores you probably already know: you
are at least theoretically likely to get more false positives, because the
score set was optimized for the original scores.

Of course, everyone tweaks a few scores at least.  After all, that is why
they are tweakable.  As long as you watch you spam bucket for FPs you can go
pretty high on things.  Looking at today's spam I only see one of these, but
it scored around 30.  I have a bunch of the Re: news kind that all scored
35-39.

   Loren


And most of those which are not black lists are from 88_FVGT_body.cf.

{^_^}Joanne 



Re: Those "Re: good obfupills" spams

2006-04-29 Thread Loren Wilton
> This is my first post after having lurked some. So, I'm getting these
> same "RE: good" spams but they're hitting eight rules and typically
> scoring between 30 and 40. I'm really unsophisticated compared to you
> guys, and it begs the question––what am I doing wrong? All I use is a
> tweaked user_prefs wherein I have gradually raised the scores on
> standard rules found in spam that slips through over a period of
> time. These particular spams are over the top on bayesian (1.0), have
> multiple database hits, forged rcvd_helo and so forth. Bayesian alone
> flags them for me. I'm trying to understand the reason you would not
> want to have these type of rules set high enough? I must be way over
> optimized––what am I not getting?

The danger with tweaking standard rule scores you probably already know: you
are at least theoretically likely to get more false positives, because the
score set was optimized for the original scores.

Of course, everyone tweaks a few scores at least.  After all, that is why
they are tweakable.  As long as you watch you spam bucket for FPs you can go
pretty high on things.  Looking at today's spam I only see one of these, but
it scored around 30.  I have a bunch of the Re: news kind that all scored
35-39.

Loren



Re: Those "Re: good obfupills" spams

2006-04-29 Thread jdow

From: "Matt Kettler" <[EMAIL PROTECTED]>


jdow wrote:





BAYES_99, by definition, has a 1% false positive rate.


That is what Bayes thinks. I think it is closer to something between
0.5% and 0.1% false positive. I have mine trained down lethally fine
at this point, it appears.


Ok.. Fine, let's take 0.1% FP rate, 10x better than theoretical, but
still realistic at some sites.. Even still.. Is that low enough to be
worth assigning >5.0 points to?

No.


So far, however, it has been worth 5.0 points. I've had it (actually)
false positive maybe once in the last month. I've had SA mismark some
BAYES_99 spam, however. The spam had other characteristics that earned
a slight negative score.

(I've since developed some meta rules that are reducing this. It the
email is from a mailing list I know I give a modest negative score.
Then if the Bayes is high or very high I award some positive points.
High plus mailing list is about 2 points with mailing list being -1.5.
Very high adds another 2 points. That second two points MAY have to be
fine tuned upwards.)

{^_^}