Re: update on floating dividing score between spam and ham messages

2005-07-18 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


btw, I was just rereading this -- an interesting approach you might
want to experiment with, is having *two* boundaries.  ie:

negative scores  positive scores
  |---|--
   |   |
  ham  | .unsure.. | spam
  

if a mail scores = ham threshold, it's ham; = spam threshold, it's spam;
and  ham threshold and  spam threshold, it's unsure.  this is similar
to the SpamBayes UI.

- --j.

Joe Flowers writes:
 I don't know if this will help anyone or not, but I wanted to report 
 back just in case.
 
 In early April, I completely unhinged the dividing line between what SA 
 score is used to mark a message as spam or ham (5.00 = default). This 
 allows the system and this dividing line to drift freely to anywhere 
 that SA will allow, without bound. This anti-spam setup has worked 
 consistently much much better the whole time than in any previous 
 implementation that we have done and with very little maintenance. We 
 are very happy with it and are looking forward to implementing future SA 
 versions in the same fashion.
 
 I'm not exactly sure the following numbers represent the whole time 
 since April, but they should be pretty close.
 
 We've had 360,922 spam messages and 396,983 ham messages with a 
 normalized average spam score of 6.8714134 and a normalized average ham 
 score of -2.1532284.  I have the divding line set at 30% of the 
 distance between the average ham score and average spam score (30% above 
 the average ham score). So, the dividing line is currently floating 
 around 0.55416414.
 
 Apart from the default SA install, the only thing I have changed is
 1. Turned off auto-learn --- I think this is very important.
 2. Set SA to ignore our custom spam score tag in the message headers.
 
 We are currently running SA v3.02.
 
  From time to time, but not very often (a couple of times every two 
 weeks or so), I do feed bayes (sa-learn) with a few messages that are 
 misplaced. I don't know the stats, but we have very few false positives, 
 so I'm mostly feeding bayes with the false negatives which consist of 
 the new/different message tricks that the spammers are using.
 
 Everyone here has been very happy with the results. It's been much much 
 better than any implementation in the past.
 Many thanks to the SA developers! Rock on!
 
 Joe
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFC3FC5MJF5cimLx9ARAnWnAJ0Up+/8hC00748EPiGO2fk5p7c4IACeMWXr
JgKnIDrK1LkPPzsne+7N+SA=
=3I84
-END PGP SIGNATURE-



Re: update on floating dividing score between spam and ham messages

2005-07-18 Thread Joe Flowers

Justin,

Do you have suggestions on how I should come up with the two boundary 
lines and what do I do with the unsure messages?

I'm all ears.

Joe



Justin Mason wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


btw, I was just rereading this -- an interesting approach you might
want to experiment with, is having *two* boundaries.  ie:

negative scores  positive scores
 |---|--
  |   |
 ham  | .unsure.. | spam
 


if a mail scores = ham threshold, it's ham; = spam threshold, it's spam;
and  ham threshold and  spam threshold, it's unsure.  this is similar
to the SpamBayes UI.

- --j.
 





Re: update on floating dividing score between spam and ham messages

2005-07-12 Thread Joe Flowers

Kai Schaetzl wrote:


Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400:

 

That's bad, really bad 
detection ... 
 



No. It's good, really good detection.

You should improve that instead of trying to find a 
barrier which gives you the best FP:FN ratio.



I'm not trying to find the best FP:FN ratio and I never said I was.

It may indeed give you the 
best ratio with your bad setup 
 

No. It's a good setup and I won't tell our users that they should stop 
complimenting us on it's success because the results speak for 
themselves and back up those compliments. And yes, it does work very 
good in the short and long run - you are purely speculating that it is not.



BTW: what does normalized exactly mean in this context?
 



For lack of a better term in mind, I used normalized. If the score of 
a message is more than 30 points (or 25, I'm not going to waste time 
looking back at the code) away from the nearest average, then I set the 
score for the message back to 30 points away from the nearest average. 
It's to help the difting averages from changes rapidly and abruptly. It 
does not prevent them from drifting anywhere however.


It sounds like you have put in a lot of time to become an expert in the 
traditional wisdom of SA and to tune it accordingly. And, I assume you 
spend a lot of time keeping it tuned and dealing with SA upgrades. I'm 
glad you have that time But, my situation is different and I agree 
with some of the crtitics of SA - that it requires or almost requires an 
expert to tune it properly and to keep it tuned properly - at least with 
the traditional wisdom of how it should be set up. My goal is not to 
become a SA expert, it would be nice if it happened along the way 
somewhere down the road, but it is not a goal. Also, I have no plans of 
hiring an SA expert unless it comes as a bonus along with the more 
important things we are doing here. It will not be a deciding factor. 
It's nice, but it is not primary.


And again, you are wrong. It is a very good setup (the proof is in the 
pudding) and the only thing I see being changed in the future is an 
upgrade to the latest version of SA.


Joe



Re: update on floating dividing score between spam and ham messages

2005-07-12 Thread Kai Schaetzl
Joe Flowers wrote on Tue, 12 Jul 2005 11:55:36 -0400:

 That's bad, really bad 
 detection ... 

  
  
 No. It's good, really good detection. 

Sorry, I don't want to be rude by repeating myself, but if your average spam 
score is something like 6-something the *detection* *is* bad. Maybe not the 
end result but the pure spam detection. And that's also the reason why you had 
to try and find a method which lowers the threshold without giving you too 
much false positives. If your spam would score high enough you simply wouldn't 
need to do that. That's btw exactly what you said yourself:

  But anything you can do that widens the 
  typical score distribution between ham and spam is a good thing. 
  
 Amen



 For lack of a better term in mind, I used normalized. If the score of 
 a message is more than 30 points (or 25, I'm not going to waste time 
 looking back at the code) away from the nearest average, then I set the 
 score for the message back to 30 points away from the nearest average. 

Ok, I see you want to avoid peaks, makes sense.

 It sounds like you have put in a lot of time to become an expert in the 
 traditional wisdom of SA and to tune it accordingly.

Not more than others here. Not really too much time.

 And, I assume you 
 spend a lot of time keeping it tuned and dealing with SA upgrades.

Not at all. I have once carefully crafted a combination of my own rules plus 
SARE rules some time ago, trained it a lot of spam and ham at first and now 
let it just run, SARE updates are done automatically by rulesdujour. I haven't 
put much attention to it for probably a year now. Just some upgrading to SA 
3.1.* recently and maybe choosing a different SARE ruleset here and there.

 I'm 
 glad you have that time But, my situation is different and I agree 
 with some of the crtitics of SA - that it requires or almost requires an 
 expert to tune it properly and to keep it tuned properly

You indeed need some time to understand how it all works together, but then 
you don't need to apply too much care anymore, really. Of course, you should 
stay up-to-date with releases and the rulesets you use, but that's not a daily 
business at all.

  
 And again, you are wrong. It is a very good setup (the proof is in the 
 pudding)

As I said earlier if you look closer at the pudding I'm sure that your false 
positive rate is much higher than ours. Do you have a proven figure of your FP 
rate?
To make it clear: I don't want to say that you have bad results from your 
setup. But I'm quite convinced that your FP rate could be much better if you 
tried to widen the gap between the ham and spam by applying more rules that 
are able to classify spam and maybe by finetuning a few rules scores (f.i. if 
you Bayes_99 is reliable you should boost it to 3 or 4, it's overly low in the 
3.0 setups). Your peaks are only 8 score points away from each other. Ours are 
more than 20 points away from each other and the vale between them is really 
low. Which means even if I slide the threshold for one absolute score point 
from 5 to 6 or down to 4 I won't get a much different detection rate because 
there's so few messages scoring in that range. I'm sure that many on this list 
would have similar results if they did that. But I suppose you can't do that 
because your gap is simply too small. What happens if you move your threshold 
one score point down or up?



Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de  http://msie.winware.org





Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Matt Kettler
Joe Flowers wrote:
 I don't know if this will help anyone or not, but I wanted to report
 back just in case.
 
 In early April, I completely unhinged the dividing line between what SA
 score is used to mark a message as spam or ham (5.00 = default). This
 allows the system and this dividing line to drift freely to anywhere
 that SA will allow, without bound. This anti-spam setup has worked
 consistently much much better the whole time than in any previous
 implementation that we have done and with very little maintenance. We
 are very happy with it and are looking forward to implementing future SA
 versions in the same fashion.
 
 I'm not exactly sure the following numbers represent the whole time
 since April, but they should be pretty close.
 
 We've had 360,922 spam messages and 396,983 ham messages with a
 normalized average spam score of 6.8714134 and a normalized average ham
 score of -2.1532284.  I have the divding line set at 30% of the
 distance between the average ham score and average spam score (30% above
 the average ham score). So, the dividing line is currently floating
 around 0.55416414.


The only problem I see with this approach is that it treats false positives and
false negatives as being equally bad.

In general, you're adjusting the score bias so the number of FP's and FNs are
approximately equal. Although STATISTICS*.txt would suggest that this boundary
occurs somewhere near 2.0, your own local biases could change this considerably.


SA's normal scoreset is evolved with the concept that it's better to have 99
false negatives than 1 false positive. The concept here is most people use
scripts to move their spam into a separate folder, or auto delete it. With that
going on, a FP is potentially lost valid email, whereas a FN is a minor
inconvenience.

For any site that considers FPs to be not too bad because all mail is manually
examined anyway, lowering the score threshold may be a workable thing.

However, other sites that auto-delete such messages may have considerable
problems if they lower the threshold.



Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers

Matt Kettler wrote:


The only problem I see with this approach is that it treats false positives and
false negatives as being equally bad.
 



We do get many more false negatives than false positives, even though we 
don't get false positives very often - they are rare.

We certainly don't get 1 fp for every fn.


In general, you're adjusting the score bias so the number of FP's and FNs are
approximately equal. 



This is not what we are seeing in practice. It's not even close to 50-50.


Although STATISTICS*.txt would suggest that this boundary
occurs somewhere near 2.0, your own local biases could change this considerably.


SA's normal scoreset is evolved with the concept that it's better to have 99
false negatives than 1 false positive. 



We are very glad and happy about this concept and implementation.


The concept here is most people use
scripts to move their spam into a separate folder, or auto delete it. With that
going on, a FP is potentially lost valid email, whereas a FN is a minor
inconvenience.
 



Yes We work hard to inform our users and to actively solicit their 
feedback on how the system is working and to lookout for the system 
misplacing emails, especially valid ones. I know it's still not perfect



For any site that considers FPs to be not too bad because all mail is manually
examined anyway, lowering the score threshold may be a workable thing.

However, other sites that auto-delete such messages may have considerable
problems if they lower the threshold
 



YES!





Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


There's another thing worth noting -- the SpamAssassin score distribution
for hams and spams isn't even.

If you draw a graph of hams and spams, plotting the number of mails in
each category as the vertical axis and the score they get as teh
horizontal axis, you don't get a simple pair of intersecting straight
lines.

Instead, since we have many more mark-as-spam rules than mark-as-ham,
and due to how the perceptron attempts to optimise for the 5.0
threshold, what happens is that you have two different lines.

The ham line is a sigmoid curve, that starts high in the negative area,
and curves down to almost 0 at the 5.0 threshold mark.  The spam line, by
contrast, is a straight line.
http://taint.org/xfer/2005/score-dist-doodle.gif is a doodle to illustrate
this, or take a look at
http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
for real-world graphs of this data from 2002 -- although graphing
the inverse.

Very interesting approach though!

- --j.

Joe Flowers writes:
 Matt Kettler wrote:
 
 The only problem I see with this approach is that it treats false positives 
 and
 false negatives as being equally bad.
   
 
 
 We do get many more false negatives than false positives, even though we 
 don't get false positives very often - they are rare.
 We certainly don't get 1 fp for every fn.
 
 In general, you're adjusting the score bias so the number of FP's and FNs are
 approximately equal. 
 
 
 This is not what we are seeing in practice. It's not even close to 50-50.
 
 Although STATISTICS*.txt would suggest that this boundary
 occurs somewhere near 2.0, your own local biases could change this 
 considerably.
 
 
 SA's normal scoreset is evolved with the concept that it's better to have 99
 false negatives than 1 false positive. 
 
 
 We are very glad and happy about this concept and implementation.
 
 The concept here is most people use
 scripts to move their spam into a separate folder, or auto delete it. With 
 that
 going on, a FP is potentially lost valid email, whereas a FN is a minor
 inconvenience.
   
 
 
 Yes We work hard to inform our users and to actively solicit their 
 feedback on how the system is working and to lookout for the system 
 misplacing emails, especially valid ones. I know it's still not perfect
 
 For any site that considers FPs to be not too bad because all mail is 
 manually
 examined anyway, lowering the score threshold may be a workable thing.
 
 However, other sites that auto-delete such messages may have considerable
 problems if they lower the threshold
   
 
 
 YES!
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFC0qYfMJF5cimLx9ARAp+YAJ0X7eoijcnMOE+3WkOlfQQEzasjwgCfZp9B
TdyM6BfLga48fgif1AzBW7U=
=qdan
-END PGP SIGNATURE-



Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers

Thanks Jason!

That's good, new info for me. That'll help me *at the very least* 
visualize what I am trying to do a little better. I've been very curious 
to know what the rough shapes of those graphs look like.


Joe



Justin Mason wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


There's another thing worth noting -- the SpamAssassin score distribution
for hams and spams isn't even.

If you draw a graph of hams and spams, plotting the number of mails in
each category as the vertical axis and the score they get as teh
horizontal axis, you don't get a simple pair of intersecting straight
lines.

Instead, since we have many more mark-as-spam rules than mark-as-ham,
and due to how the perceptron attempts to optimise for the 5.0
threshold, what happens is that you have two different lines.

The ham line is a sigmoid curve, that starts high in the negative area,
and curves down to almost 0 at the 5.0 threshold mark.  The spam line, by
contrast, is a straight line.
http://taint.org/xfer/2005/score-dist-doodle.gif is a doodle to illustrate
this, or take a look at
http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
for real-world graphs of this data from 2002 -- although graphing
the inverse.

Very interesting approach though!

- --j.
 






Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Matt Kettler
Joe Flowers wrote:
 Matt Kettler wrote:
 
 The only problem I see with this approach is that it treats false
 positives and
 false negatives as being equally bad.
  

 
 We do get many more false negatives than false positives, even though we
 don't get false positives very often - they are rare.
 We certainly don't get 1 fp for every fn.
 
 In general, you're adjusting the score bias so the number of FP's and
 FNs are
 approximately equal.
 
 
 This is not what we are seeing in practice. It's not even close to 50-50.
 

Based on JM's comments about the score distribution for hams being non-linear,
this makes sense. If the distribution was linear for both you'd get 50/50 by
dividing the score between the two means.

Since the ham is going to have a pretty sharp drop-off somewhere slightly above
it's mean your split score approach won't be as bad as 1:1, but it's also likely
to not be as good as 100:1 which the 5.0 threshold should get you.

It's an interesting concept, and it would be very interesting to graph out FP vs
FN rates against thresholds.

This graph from JM's post is real data:
http://spamassassin.apache.org/presentations/HEANet_2002/img12.html

But it doesn't go below 5.0. It would be interesting to see how those curves
continue as you approach 0.

This graph is a good conceptual one in the normal sense of numbers:
http://taint.org/xfer/2005/score-dist-doodle.gif

That graph would suggest that somewhere below 5.0 there is a threshold at which
the ham FP rate gets MUCH worse in a very sudden way. However, there's no score
associated. I'd venture to guess that your average of the means is going to
wind up picking something near, but just above that threshold.

That's a bit of an intuitive guess, but also it has some roots in reality. The
average score of a ham message on a curve like that is going to wind up being
somewhere in the middle of that nasty drop off. By biasing just above that you
should bring yourself into the second part of the curve, where decreases in
score have a somewhat modest impact on FP rate.


Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


the real-world figures can be seen for various thresholds in
the rules/STATISTICS*.txt files...

- --j.

Matt Kettler writes:
 Joe Flowers wrote:
  Matt Kettler wrote:
  
  The only problem I see with this approach is that it treats false
  positives and
  false negatives as being equally bad.
   
 
  
  We do get many more false negatives than false positives, even though we
  don't get false positives very often - they are rare.
  We certainly don't get 1 fp for every fn.
  
  In general, you're adjusting the score bias so the number of FP's and
  FNs are
  approximately equal.
  
  
  This is not what we are seeing in practice. It's not even close to 50-50.
  
 
 Based on JM's comments about the score distribution for hams being non-linear,
 this makes sense. If the distribution was linear for both you'd get 50/50 by
 dividing the score between the two means.
 
 Since the ham is going to have a pretty sharp drop-off somewhere slightly 
 above
 it's mean your split score approach won't be as bad as 1:1, but it's also 
 likely
 to not be as good as 100:1 which the 5.0 threshold should get you.
 
 It's an interesting concept, and it would be very interesting to graph out FP 
 vs
 FN rates against thresholds.
 
 This graph from JM's post is real data:
 http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
 
 But it doesn't go below 5.0. It would be interesting to see how those curves
 continue as you approach 0.
 
 This graph is a good conceptual one in the normal sense of numbers:
 http://taint.org/xfer/2005/score-dist-doodle.gif
 
 That graph would suggest that somewhere below 5.0 there is a threshold at 
 which
 the ham FP rate gets MUCH worse in a very sudden way. However, there's no 
 score
 associated. I'd venture to guess that your average of the means is going to
 wind up picking something near, but just above that threshold.
 
 That's a bit of an intuitive guess, but also it has some roots in reality. The
 average score of a ham message on a curve like that is going to wind up being
 somewhere in the middle of that nasty drop off. By biasing just above that you
 should bring yourself into the second part of the curve, where decreases in
 score have a somewhat modest impact on FP rate.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFC0q8dMJF5cimLx9ARAuLrAKCQnoc8eo2rAvIDYIWX0DfW/T0NZgCePoyH
WZS8C6aamuWZ3H6C6n8k2n0=
=Hruw
-END PGP SIGNATURE-



Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread jdow
From: Matt Kettler [EMAIL PROTECTED]

 Joe Flowers wrote:
  I don't know if this will help anyone or not, but I wanted to report
  back just in case.
 
  In early April, I completely unhinged the dividing line between what SA
  score is used to mark a message as spam or ham (5.00 = default). This
  allows the system and this dividing line to drift freely to anywhere
  that SA will allow, without bound. This anti-spam setup has worked
  consistently much much better the whole time than in any previous
  implementation that we have done and with very little maintenance. We
  are very happy with it and are looking forward to implementing future SA
  versions in the same fashion.
 
  I'm not exactly sure the following numbers represent the whole time
  since April, but they should be pretty close.
 
  We've had 360,922 spam messages and 396,983 ham messages with a
  normalized average spam score of 6.8714134 and a normalized average ham
  score of -2.1532284.  I have the divding line set at 30% of the
  distance between the average ham score and average spam score (30% above
  the average ham score). So, the dividing line is currently floating
  around 0.55416414.


 The only problem I see with this approach is that it treats false
positives and
 false negatives as being equally bad.

 In general, you're adjusting the score bias so the number of FP's and FNs
are
 approximately equal. Although STATISTICS*.txt would suggest that this
boundary
 occurs somewhere near 2.0, your own local biases could change this
considerably.


 SA's normal scoreset is evolved with the concept that it's better to have
99
 false negatives than 1 false positive. The concept here is most people use
 scripts to move their spam into a separate folder, or auto delete it. With
that
 going on, a FP is potentially lost valid email, whereas a FN is a minor
 inconvenience.

Operating experience here seems to indicate that the SA score evolution
is not optimum. What you want to do is create a cough brassiere curve
for the markups for ham and spam. The greater the separation choke the
better the results for a decision point between them. The bias to
prevent false negatives probably means you do not want the decision
point right in the center. But anything you can do that widens the
typical score distribution between ham and spam is a good thing. It makes
the decision point less sensitive to set and the overall error rates
lower. I think this is part of the reason I have so much success on a
box vastly overloaded with SARE and other rules. The good rules pile
one on the other until it's VERY clear what is ham and what is spam.

(It surely would be nice if there were some really good indications of
not spam. However, nothing has ever appeared other than absence of
hits on spam-sign.)

{^_^}




Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Loren Wilton
  score of -2.1532284.  I have the divding line set at 30% of the
  distance between the average ham score and average spam score (30% above
  the average ham score). So, the dividing line is currently floating
  around 0.55416414.


 The only problem I see with this approach is that it treats false
positives and
 false negatives as being equally bad.

Matt, isn't he actually treating an FP as ~2x as bad as an FN?  He has the
divider set to 30%, so is biassed in one direction or the other.

Which of course means that by picking the ratio value you can pick pretty
much any fp/fn ratio you want.

Loren



Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers

Matt:

I know you know a lot more about this than I do, but for what it's 
worth, you're impressions/intuitions are very close to mine.
Originally back in April, I started off using the average of the 
means, but that let through way too much spam.


So, what I have now is it set to 30% above the average spam score, which 
is 20% below the average of the means.
The assumption being that the optimal spot is somewhere between the two 
averages.


Also, that nastly drop off that produces a lot of FPs is in my intuition 
too and as of yet, we haven't run into it.


Now, if the two curves could be slid apart wider so that there is a big 
deadzone,... Although, without upgrading to a newer version of SA, I 
don't see how I can expect much better results.


BTW, if anyone knows a command line program that can easy run thu a 
bunch of mbox files and tell how many messages are in them, I will 
report back how many ham and how many spam messages that I have fed to 
bayes. It's far from perfect, but it may offer some interesting info 
regarding the 100:1 (fn:fp) ratio.


Joe


Matt Kettler wrote:


Joe Flowers wrote:
 


Matt Kettler wrote:

   


The only problem I see with this approach is that it treats false
positives and
false negatives as being equally bad.


 


We do get many more false negatives than false positives, even though we
don't get false positives very often - they are rare.
We certainly don't get 1 fp for every fn.

   


In general, you're adjusting the score bias so the number of FP's and
FNs are
approximately equal.
 


This is not what we are seeing in practice. It's not even close to 50-50.

   



Based on JM's comments about the score distribution for hams being non-linear,
this makes sense. If the distribution was linear for both you'd get 50/50 by
dividing the score between the two means.

Since the ham is going to have a pretty sharp drop-off somewhere slightly above
it's mean your split score approach won't be as bad as 1:1, but it's also likely
to not be as good as 100:1 which the 5.0 threshold should get you.

It's an interesting concept, and it would be very interesting to graph out FP vs
FN rates against thresholds.

This graph from JM's post is real data:
http://spamassassin.apache.org/presentations/HEANet_2002/img12.html

But it doesn't go below 5.0. It would be interesting to see how those curves
continue as you approach 0.

This graph is a good conceptual one in the normal sense of numbers:
http://taint.org/xfer/2005/score-dist-doodle.gif

That graph would suggest that somewhere below 5.0 there is a threshold at which
the ham FP rate gets MUCH worse in a very sudden way. However, there's no score
associated. I'd venture to guess that your average of the means is going to
wind up picking something near, but just above that threshold.

That's a bit of an intuitive guess, but also it has some roots in reality. The
average score of a ham message on a curve like that is going to wind up being
somewhere in the middle of that nasty drop off. By biasing just above that you
should bring yourself into the second part of the curve, where decreases in
score have a somewhat modest impact on FP rate.

 






Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Loren Wilton
 There's another thing worth noting -- the SpamAssassin score distribution
 for hams and spams isn't even.

I don't necessarily see that those particular curve shapes necessarily in
any way invalidate this method, although they do bias the method somewhat.
The two curves are essentially smooth curves with no major dips or bumps in
them, so it is possible to select a ratio without getting inversions in the
ratio as the selector moves from left to right.  You may have to be careful
of calculating the ratio, given that ham goes to effectively zero above a
certain value.  But n:0 and 3.45n:0 are still perfectly valid ratios to deal
with, even if one of the terms is zero.

Loren



Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers

jdow wrote:

 The greater the separation choke the
 better the results for a decision point between them.

 But anything you can do that widens the
 typical score distribution between ham and spam is a good thing.

Amen




Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread jdow
A few weeks ago I'd have said Easy, Ducky! Then I ran into DoveCot
that uses an indexed almost mbox file. There is no way to do it other
than good guess. However, for a traditional UNIX mbox file you should
be able to nail it perfectly simply looking for the From feature. The
dirt stupid mail utility looks for a blank line followed by a line
that starts with From. All other lines that start with From are supposed
to be escaped to ensure accurate detection. DoveCot skips this blank like
feature sometimes. mail does not like this. I have not yet seen any
indication that SA is upset with this, however.

{^_^}
- Original Message - 
From: Joe Flowers [EMAIL PROTECTED]

 Matt:

 I know you know a lot more about this than I do, but for what it's
 worth, you're impressions/intuitions are very close to mine.
 Originally back in April, I started off using the average of the
 means, but that let through way too much spam.

 So, what I have now is it set to 30% above the average spam score, which
 is 20% below the average of the means.
 The assumption being that the optimal spot is somewhere between the two
 averages.

 Also, that nastly drop off that produces a lot of FPs is in my intuition
 too and as of yet, we haven't run into it.

 Now, if the two curves could be slid apart wider so that there is a big
 deadzone,... Although, without upgrading to a newer version of SA, I
 don't see how I can expect much better results.

 BTW, if anyone knows a command line program that can easy run thu a
 bunch of mbox files and tell how many messages are in them, I will
 report back how many ham and how many spam messages that I have fed to
 bayes. It's far from perfect, but it may offer some interesting info
 regarding the 100:1 (fn:fp) ratio.

 Joe


 Matt Kettler wrote:

 Joe Flowers wrote:
 
 
 Matt Kettler wrote:
 
 
 
 The only problem I see with this approach is that it treats false
 positives and
 false negatives as being equally bad.
 
 
 
 
 We do get many more false negatives than false positives, even though we
 don't get false positives very often - they are rare.
 We certainly don't get 1 fp for every fn.
 
 
 
 In general, you're adjusting the score bias so the number of FP's and
 FNs are
 approximately equal.
 
 
 This is not what we are seeing in practice. It's not even close to
50-50.
 
 
 
 
 Based on JM's comments about the score distribution for hams being
non-linear,
 this makes sense. If the distribution was linear for both you'd get 50/50
by
 dividing the score between the two means.
 
 Since the ham is going to have a pretty sharp drop-off somewhere slightly
above
 it's mean your split score approach won't be as bad as 1:1, but it's also
likely
 to not be as good as 100:1 which the 5.0 threshold should get you.
 
 It's an interesting concept, and it would be very interesting to graph
out FP vs
 FN rates against thresholds.
 
 This graph from JM's post is real data:
 http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
 
 But it doesn't go below 5.0. It would be interesting to see how those
curves
 continue as you approach 0.
 
 This graph is a good conceptual one in the normal sense of numbers:
 http://taint.org/xfer/2005/score-dist-doodle.gif
 
 That graph would suggest that somewhere below 5.0 there is a threshold at
which
 the ham FP rate gets MUCH worse in a very sudden way. However, there's no
score
 associated. I'd venture to guess that your average of the means is
going to
 wind up picking something near, but just above that threshold.
 
 That's a bit of an intuitive guess, but also it has some roots in
reality. The
 average score of a ham message on a curve like that is going to wind up
being
 somewhere in the middle of that nasty drop off. By biasing just above
that you
 should bring yourself into the second part of the curve, where decreases
in
 score have a somewhat modest impact on FP rate.
 
 
 





Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400:

 We are very glad and happy about this concept and implementation.

Well, the big question is: How many of your spam messages score between 
the default 5 and your floating score? If it is many there's obviously 
something wrong with your setup: your spam is not scoring high enough. 
Additionally, it means that your Bayes auto-learn will feed less spam to 
learn than it could because your overall spam score is way too low. Our 
average spam score is indeed around -2 as yours is. And it's a very high 
peak, -2 mails are more than any other ham mails combined. However, our 
spam score peak is *way* higher than yours is: it flattens over 18 and 
30, so the average is somewhere around 25 or so. (I deduced that from 
looking at the raw figures not by calculating a median or average.) I 
consider your average spam score of 6 as *extremely* bad from a detection 
standpoint.
With a score of 0.5 I would get a *considerable* amount of ham scored as 
spam. With the default of 5 we get almost none, not even one per day. I 
doubt that your rate of FPs is nearly non-existant with a spam threshold 
of 0.5. There *must* be a considerable rate of FPs, you just don't hear 
about it.

I think the general approach on this list is to make spam score as spammy 
as possible. That's what we do as well. Instead of driving spam to the sky 
you are trying to find some non-existing barrier which may indeed float 
because tomorrow's messages score different than yesterday's. It does not 
float at all in the long run. And it exists *only* in the long run. It may 
throw off next day's detection quite heavily, since there's no guarantee 
spam and ham look the same next day or even float around that point. It's 
not even a statistical figure, you deliberately set it to 30%, probably 
because you get too much spam if you set it higher. That's bad, really bad 
detection ... 
If much of your spam is lower than 5 than the spam detection rate of your 
SA is quite bad. You should improve that instead of trying to find a 
barrier which gives you the best FP:FN ratio. It may indeed give you the 
best ratio with your bad setup but not the lowest FP rate and probably not 
the best ratio compared with a setup that drives spam to the sky.
I see your approach as an interesting way of optimizing the threshold when 
you don't get optimal scores. But you would be better off to optimize the 
scores.

BTW: what does normalized exactly mean in this context?

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de  http://msie.winware.org





Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Loren Wilton wrote on Mon, 11 Jul 2005 11:30:07 -0700:

 Which of course means that by picking the ratio value you can pick pretty 
 much any fp/fn ratio you want.

Only if the distribution was equal.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de  http://msie.winware.org





Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kris Deugau
jdow wrote:
 A few weeks ago I'd have said Easy, Ducky! Then I ran into DoveCot
 that uses an indexed almost mbox file. There is no way to do it
 other than good guess. However, for a traditional UNIX mbox file
 you should be able to nail it perfectly simply looking for the From
 feature. The dirt stupid mail utility looks for a blank line
 followed by a line that starts with From. All other lines that
 start with From are supposed to be escaped to ensure accurate
 detection. DoveCot skips this blank like feature sometimes. mail
 does not like this. I have not yet seen any indication that SA is
 upset with this, however.

Just to be pedantic, it's actually (IIRC) a double newline followed by
From  (note the space!  It's important.)  Many mail-handling apps will
actually parse the From-space header in more detail, just in case.

grep ^From  mboxfile|wc -l typically gives an accurate count; 
procmail at least is bright enough to escape message body lines such
that they don't break this.

-kgd
-- 
Get your mouse off of there!  You don't know where that email has been!


Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Kai Schaetzl wrote on Mon, 11 Jul 2005 22:31:29 +0200:

 With the default of 5 we get almost none, not even one per day.

That was about FPs. Wrong. We don't get *any* FPs. We do not get even one 
*FN* per day.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de  http://msie.winware.org





Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
 BTW, if anyone knows a command line program that can easy run thu a 
bunch of mbox files and tell how many messages are in them, I will report

  back how many ham and how many spam messages that I have fed to bayes.

Well, I thought this might give some good stats on the FP:FN ratio, but 
I forgot I manually fed Bayes at the very beginning of the SA 3.02 
install to get it kick-started immediately. So, counting those messages 
won't give anything accurate :( Initially, I thought I was feeding Bayes 
just the FPs and FNs, but I forgot about the initial feeding.





Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kelson

Joe Flowers wrote:
BTW, if anyone knows a command line program that can easy run thu a 
bunch of mbox files and tell how many messages are in them, I will 
report back how many ham and how many spam messages that I have fed to 
bayes. It's far from perfect, but it may offer some interesting info 
regarding the 100:1 (fn:fp) ratio.


I usually do this:

grep -c ^From  filename

It's not perfect, since it's theoretically possible for someone to have 
a line in their message that starts with
From (to provide an example -- see if your mbox-generating program 
escapes that line!), but it's usually enough.


--
Kelson Vibber
SpeedGate Communications www.speed.net


Re: update on floating dividing score between spam and ham messages

2005-07-10 Thread Joe Flowers

Loren Wilton wrote:


This is quite interesting, and seems reasonably obvious that with the right
sort of mail (at least, maybe with any mail) this shoudl work better, since
it self tunes to your conditions.  It does of course assume a reasonable
fp/fn rate to start, but SA is generally pretty good about that.
 



Yes, and that's the assumption - that SA is good at what it does and has 
a practical desert of more than infinitesimal width somewhere/anywhere 
between real spam and real ham.



How have you implemented the floating score?  A change to SA of some sort?
 

I have a small C program which is a very small subset of spamd. I don't 
use spamc or spamd. I just make calls directly into Perl (to Spam 
Assassin since SA is a Perl module) from the C program. The C program is 
a multithreaded socket program that talks to SA and to our mail servers.



Or do you just have something that monitors the current ratio and every so
often restarts SA with a new score threshhold?
 



The averages are recalculated/updated after every message.
The dividing line floats without bound, but the 30% is set as a 
command line input argument to the C program and was based on trial and 
error. For example, 50% let way too much spam through. Lower than 30% 
*seems* like it might produce too many false positives - but has not 
been tested and it seems reasonably likely that 25% could have just as 
many FPs as 30%. i.e., an electron could jump across into the next 
energy band without ever being in the middle no-man's land - giving rise 
to a way to measure SA's imperfectness? These percentages are stabs in 
the dark about what the distributions of spam and ham really look like. 
What is intriguing to me is: why can't this 30% be a function of 
something else (maybe like weekend mail versus weekday mail, night 
versus day, etc.) that would increase the overall anti-spam 
effectiveness and not just set and left at 30%? (For example, I 
noticed that we seem to get a higher spam to ham ratio during the 
weekend than during the week.)


Joe