Re: scores too low - neural network problem?

2005-03-06 Thread Andrew Schulman
 What is the output of this on your mesages?

   spamassassin -tD 21 | pager

 What value does it show for BAYES_99 in the content analysis section?
 If it says something other than 4.07 then it confirms that you are not
 running with values from column four network test off.  It sounds
 instead like you are running with network tests enables.  Are network
 tests enabled in the debugging output?

Thank you, this was correct.  I thought I had disabled the network tests, but 
I hadn't.  I've disabled them now, and the scoring has returned to what I 
thought it should be.

Regards, Andrew.


Re: scores too low - neural network problem?

2005-03-06 Thread Andrew Schulman
  I understand that the individual test scores are fed through a neural
  network to derive the final score.  So it seems that this network has
  started to behave badly.  

 You misunderstand.  The neural network (or whatever they're using these
 days - it at least used to be a genetic algorithm) is used to assign the
 default scores, not to adjust the scores after the fact.

Thank you, you're right.  I had misunderstood that.

 More likely one of two things is happening: that header was added by
 another system running SpamAssassin, or you aren't running with the
 configuration you think you are.

You're right-- I thought I had disabled the network tests, but I hadn't, so I 
wasn't getting the scores I thought I was.  I disabled the network tests, and 
the problem is solved now.

Regards, Andrew.


scores too low - neural network problem?

2005-03-05 Thread Andrew Schulman
I'm running spamc/spamd 3.0.2 in Debian.  I have Bayesian tests turned on,
and network tests off.

Lately a lot of spam has been getting through to my mailbox.  SA's false
negative rate used to be about 1%; now it's about 50%.  Looking at the
headers for the spam that's getting through, I see that the Bayesian filter
is working correctly: almost all of the spam is tagged as BAYES_95 or
BAYES_99.  My score threshold is 5, the BAYES_99 test alone (using its
default value) is worth 4.07, and a few other tests are usually positive as
well.  Yet, the total score is around 2.5.  Here's a sample from today:

X-Spam-Status: No, score=2.7 required=5.0 tests=BAYES_99,HTML_20_30,
 HTML_FONT_INVISIBLE,HTML_IMAGE_ONLY_24,HTML_MESSAGE autolearn=no 
 version=3.0.2

The scores from the tests listed here should add up to about 5.3, but as you
can see, the total is only 2.7.  So this one gets through.

I understand that the individual test scores are fed through a neural
network to derive the final score.  So it seems that this network has
started to behave badly.  

Can anyone shed any light on this?  Is it a well-known problem?  What's the
preferred way to address it?  Remove all of SA's learned information and
retrain the network?

Thanks,
Andrew.


Re: scores too low - neural network problem?

2005-03-05 Thread Bob Proulx
Andrew Schulman wrote:
 I'm running spamc/spamd 3.0.2 in Debian.  I have Bayesian tests turned on,
 and network tests off.

I am running a similar system.  But with network tests turned on.  The
network tests such as SURBL[1] are huge factors in increasing spam
classification accuracy for me.

 almost all of the spam is tagged as BAYES_95 or BAYES_99.  My score
 threshold is 5, the BAYES_99 test alone (using its default value) is
 worth 4.07, and a few other tests are usually positive as
 well.  Yet, the total score is around 2.5.

Of course as you are aware there are four scores.

   The first score is used when both Bayes and network tests
   are disabled (score set 0). The second score is used when
   Bayes is disabled, but network tests are enabled (score set
   1). The third score is used when Bayes is enabled and
   network tests are disabled (score set 2). The fourth score
   is used when Bayes is enabled and network tests are enabled
   (score set 3).

The default for BAYES_99 in SA-3.0.2 is:

  score BAYES_99 0 0 4.070 1.886

I fell to confusion on this exact thing debugging a problem of mine a
while ago.  I thought I was using one column but was really getting
data from the other.

What is the output of this on your mesages?

  spamassassin -tD 21 | pager

What value does it show for BAYES_99 in the content analysis section?
If it says something other than 4.07 then it confirms that you are not
running with values from column four network test off.  It sounds
instead like you are running with network tests enables.  Are network
tests enabled in the debugging output?

 I understand that the individual test scores are fed through a neural
 network to derive the final score.  So it seems that this network has
 started to behave badly.  

Because you are getting the BAYES_99 tag I am sure the bayes engine is
working properly.  You are seeing a scoring difference instead.

 Can anyone shed any light on this?  Is it a well-known problem?  What's the
 preferred way to address it?  Remove all of SA's learned information and
 retrain the network?

Don't retrain!  I am convinced by your evidence that you are actually
running with network tests enables.  Compare the result with the
following.  Does this give you the results you were looking for?

  spamassassin -L -tD 21 | pager

Bob

[1] http://www.surbl.org/


Re: scores too low - neural network problem?

2005-03-05 Thread Kelson Vibber
On Saturday 05 March 2005 1:21 pm, Andrew Schulman wrote:
 I understand that the individual test scores are fed through a neural
 network to derive the final score.  So it seems that this network has
 started to behave badly.  

You misunderstand.  The neural network (or whatever they're using these days - 
it at least used to be a genetic algorithm) is used to assign the default 
scores, not to adjust the scores after the fact.

More likely one of two things is happening: that header was added by another 
system running SpamAssassin, or you aren't running with the configuration you 
think you are.

Double-check your config and make sure network tests really are disabled.  I 
added up the scores for the tests you mentioned using the 4th column (Bayes + 
network both enabled) and it comes out to 2.65 - which would round to the 2.7 
you're seeing.

-- 
Kelson Vibber
SpeedGate Communications www.speed.net