Re: update on floating dividing score between spam and ham messages
Justin, Do you have suggestions on how I should come up with the two boundary lines and what do I do with the "unsure" messages? I'm all ears. Joe Justin Mason wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 btw, I was just rereading this -- an interesting approach you might want to experiment with, is having *two* boundaries. ie: negative scores positive scores <|---|--> | | ham | .unsure.. | spam if a mail scores <= ham threshold, it's ham; >= spam threshold, it's spam; and > ham threshold and < spam threshold, it's "unsure". this is similar to the SpamBayes UI. - --j.
Re: update on floating dividing score between spam and ham messages
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 btw, I was just rereading this -- an interesting approach you might want to experiment with, is having *two* boundaries. ie: negative scores positive scores <|---|--> | | ham | .unsure.. | spam if a mail scores <= ham threshold, it's ham; >= spam threshold, it's spam; and > ham threshold and < spam threshold, it's "unsure". this is similar to the SpamBayes UI. - --j. Joe Flowers writes: > I don't know if this will help anyone or not, but I wanted to report > back just in case. > > In early April, I completely unhinged the dividing line between what SA > score is used to mark a message as spam or ham (5.00 = default). This > allows the system and this dividing line to drift "freely" to anywhere > that SA will allow, without bound. This anti-spam setup has worked > consistently much much better the whole time than in any previous > implementation that we have done and with very little maintenance. We > are very happy with it and are looking forward to implementing future SA > versions in the same fashion. > > I'm not exactly sure the following numbers represent the whole time > since April, but they should be pretty close. > > We've had 360,922 spam messages and 396,983 ham messages with a > normalized average spam score of 6.8714134 and a normalized average ham > score of -2.1532284. I have the divding line "set" at 30% of the > distance between the average ham score and average spam score (30% above > the average ham score). So, the dividing line is currently floating > around 0.55416414. > > Apart from the default SA install, the only thing I have changed is > 1. Turned off auto-learn <--- I think this is very important. > 2. Set SA to ignore our custom spam score tag in the message headers. > > We are currently running SA v3.02. > > From time to time, but not very often (a couple of times every two > weeks or so), I do feed bayes (sa-learn) with a few messages that are > misplaced. I don't know the stats, but we have very few false positives, > so I'm mostly feeding bayes with the false negatives which consist of > the new/different message tricks that the spammers are using. > > Everyone here has been very happy with the results. It's been much much > better than any implementation in the past. > Many thanks to the SA developers! Rock on! > > Joe -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Exmh CVS iD8DBQFC3FC5MJF5cimLx9ARAnWnAJ0Up+/8hC00748EPiGO2fk5p7c4IACeMWXr JgKnIDrK1LkPPzsne+7N+SA= =3I84 -END PGP SIGNATURE-
Re: update on floating dividing score between spam and ham messages
Joe Flowers wrote on Tue, 12 Jul 2005 11:55:36 -0400: > >That's bad, really bad > >detection ... > > > > > > No. It's good, really good detection. Sorry, I don't want to be rude by repeating myself, but if your average spam score is something like 6-something the *detection* *is* bad. Maybe not the end result but the pure spam detection. And that's also the reason why you had to try and find a method which lowers the threshold without giving you too much false positives. If your spam would score high enough you simply wouldn't need to do that. That's btw exactly what you said yourself: > > But anything you can do that widens the > > typical score distribution between ham and spam is a good thing. > > Amen > For lack of a better term in mind, I used "normalized". If the score of > a message is more than 30 points (or 25, I'm not going to waste time > looking back at the code) away from the nearest average, then I set the > score for the message back to 30 points away from the nearest average. Ok, I see you want to avoid peaks, makes sense. > It sounds like you have put in a lot of time to become an expert in the > traditional wisdom of SA and to tune it accordingly. Not more than others here. Not really too much time. And, I assume you > spend a lot of time keeping it tuned and dealing with SA upgrades. Not at all. I have once carefully crafted a combination of my own rules plus SARE rules some time ago, trained it a lot of spam and ham at first and now let it just run, SARE updates are done automatically by rulesdujour. I haven't put much attention to it for probably a year now. Just some upgrading to SA 3.1.* recently and maybe choosing a different SARE ruleset here and there. I'm > glad you have that time But, my situation is different and I agree > with some of the crtitics of SA - that it requires or almost requires an > expert to tune it properly and to keep it tuned properly You indeed need some time to understand how it all works together, but then you don't need to apply too much care anymore, really. Of course, you should stay up-to-date with releases and the rulesets you use, but that's not a daily business at all. > > And again, you are wrong. It is a very good setup (the proof is in the > pudding) As I said earlier if you look closer at the pudding I'm sure that your false positive rate is much higher than ours. Do you have a proven figure of your FP rate? To make it clear: I don't want to say that you have bad results from your setup. But I'm quite convinced that your FP rate could be much better if you tried to widen the gap between the ham and spam by applying more rules that are able to classify spam and maybe by finetuning a few rules scores (f.i. if you Bayes_99 is reliable you should boost it to 3 or 4, it's overly low in the 3.0 setups). Your peaks are only 8 score points away from each other. Ours are more than 20 points away from each other and the vale between them is really low. Which means even if I slide the threshold for one absolute score point from 5 to 6 or down to 4 I won't get a much different detection rate because there's so few messages scoring in that range. I'm sure that many on this list would have similar results if they did that. But I suppose you can't do that because your gap is simply too small. What happens if you move your threshold one score point down or up? Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de & http://msie.winware.org
Re: update on floating dividing score between spam and ham messages
Kai Schaetzl wrote: Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400: That's bad, really bad detection ... No. It's good, really good detection. You should improve that instead of trying to find a barrier which gives you the best FP:FN ratio. I'm not trying to find the best FP:FN ratio and I never said I was. It may indeed give you the best ratio with your bad setup No. It's a good setup and I won't tell our users that they should stop complimenting us on it's success because the results speak for themselves and back up those compliments. And yes, it does work very good in the short and long run - you are purely speculating that it is not. BTW: what does "normalized" exactly mean in this context? For lack of a better term in mind, I used "normalized". If the score of a message is more than 30 points (or 25, I'm not going to waste time looking back at the code) away from the nearest average, then I set the score for the message back to 30 points away from the nearest average. It's to help the difting averages from changes rapidly and abruptly. It does not prevent them from drifting anywhere however. It sounds like you have put in a lot of time to become an expert in the traditional wisdom of SA and to tune it accordingly. And, I assume you spend a lot of time keeping it tuned and dealing with SA upgrades. I'm glad you have that time But, my situation is different and I agree with some of the crtitics of SA - that it requires or almost requires an expert to tune it properly and to keep it tuned properly - at least with the traditional wisdom of how it "should" be set up. My goal is not to become a SA expert, it would be nice if it happened along the way somewhere down the road, but it is not a goal. Also, I have no plans of hiring an SA expert unless it comes as a bonus along with the more important things we are doing here. It will not be a deciding factor. It's nice, but it is not primary. And again, you are wrong. It is a very good setup (the proof is in the pudding) and the only thing I see being changed in the future is an upgrade to the latest version of SA. Joe
Re: update on floating dividing score between spam and ham messages
Joe Flowers wrote: BTW, if anyone knows a command line program that can easy run thu a bunch of mbox files and tell how many messages are in them, I will report back how many ham and how many spam messages that I have fed to bayes. It's far from perfect, but it may offer some interesting info regarding the 100:1 (fn:fp) ratio. I usually do this: grep -c "^From " filename It's not perfect, since it's theoretically possible for someone to have a line in their message that starts with From (to provide an example -- see if your mbox-generating program escapes that line!), but it's usually enough. -- Kelson Vibber SpeedGate Communications
Re: update on floating dividing score between spam and ham messages
> BTW, if anyone knows a command line program that can easy run thu a bunch of mbox files and tell how many messages are in them, I will report > back how many ham and how many spam messages that I have fed to bayes. Well, I thought this might give some good stats on the FP:FN ratio, but I forgot I manually fed Bayes at the very beginning of the SA 3.02 install to get it kick-started immediately. So, counting those messages won't give anything accurate :( Initially, I thought I was feeding Bayes just the FPs and FNs, but I forgot about the initial feeding.
Re: update on floating dividing score between spam and ham messages
Kai Schaetzl wrote on Mon, 11 Jul 2005 22:31:29 +0200: > With the default of 5 we get almost none, not even one per day. That was about FPs. Wrong. We don't get *any* FPs. We do not get even one *FN* per day. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de & http://msie.winware.org
Re: update on floating dividing score between spam and ham messages
jdow wrote: > A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot > that uses an indexed almost "mbox" file. There is no way to do it > other than "good guess". However, for a traditional UNIX mbox file > you should be able to nail it perfectly simply looking for the "From" > feature. The dirt stupid "mail" utility looks for a blank line > followed by a line that starts with "From". All other lines that > start with From are supposed to be escaped to ensure accurate > detection. DoveCot skips this blank like feature sometimes. "mail" > does not like this. I have not yet seen any indication that SA is > upset with this, however. Just to be pedantic, it's actually (IIRC) a double newline followed by "From " (note the space! It's important.) Many mail-handling apps will actually parse the From-space "header" in more detail, "just in case". grep "^From " |wc -l typically gives an accurate count; procmail at least is bright enough to escape message body lines such that they don't break this. -kgd -- Get your mouse off of there! You don't know where that email has been!
Re: update on floating dividing score between spam and ham messages
Loren Wilton wrote on Mon, 11 Jul 2005 11:30:07 -0700: > Which of course means that by picking the ratio value you can pick pretty > much any fp/fn ratio you want. Only if the distribution was equal. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de & http://msie.winware.org
Re: update on floating dividing score between spam and ham messages
Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400: > We are very glad and happy about this concept and implementation. Well, the big question is: How many of your spam messages score between the default 5 and your "floating score"? If it is many there's obviously something wrong with your setup: your spam is not scoring high enough. Additionally, it means that your Bayes auto-learn will feed less spam to learn than it could because your overall spam score is way too low. Our average spam score is indeed around -2 as yours is. And it's a very high peak, -2 mails are more than any other ham mails combined. However, our spam score peak is *way* higher than yours is: it "flattens" over 18 and 30, so the average is somewhere around 25 or so. (I deduced that from looking at the raw figures not by calculating a median or average.) I consider your average spam score of 6 as *extremely* bad from a detection standpoint. With a score of 0.5 I would get a *considerable* amount of ham scored as spam. With the default of 5 we get almost none, not even one per day. I doubt that your rate of FPs is nearly non-existant with a spam threshold of 0.5. There *must* be a considerable rate of FPs, you just don't hear about it. I think the general approach on this list is to make spam score as spammy as possible. That's what we do as well. Instead of driving spam to the sky you are trying to find some non-existing "barrier" which may indeed float because tomorrow's messages score different than yesterday's. It does not float at all in the long run. And it exists *only* in the long run. It may throw off next day's detection quite heavily, since there's no guarantee spam and ham look the same next day or even float around that point. It's not even a statistical figure, you deliberately set it to 30%, probably because you get too much spam if you set it higher. That's bad, really bad detection ... If much of your spam is lower than 5 than the spam detection rate of your SA is quite bad. You should improve that instead of trying to find a barrier which gives you the best FP:FN ratio. It may indeed give you the best ratio with your bad setup but not the lowest FP rate and probably not the best ratio compared with a setup that drives spam to the sky. I see your approach as an interesting way of optimizing the threshold when you don't get optimal scores. But you would be better off to optimize the scores. BTW: what does "normalized" exactly mean in this context? Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de & http://msie.winware.org
Re: update on floating dividing score between spam and ham messages
A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot that uses an indexed almost "mbox" file. There is no way to do it other than "good guess". However, for a traditional UNIX mbox file you should be able to nail it perfectly simply looking for the "From" feature. The dirt stupid "mail" utility looks for a blank line followed by a line that starts with "From". All other lines that start with From are supposed to be escaped to ensure accurate detection. DoveCot skips this blank like feature sometimes. "mail" does not like this. I have not yet seen any indication that SA is upset with this, however. {^_^} - Original Message - From: "Joe Flowers" <[EMAIL PROTECTED]> > Matt: > > I know you know a lot more about this than I do, but for what it's > worth, you're impressions/intuitions are very close to mine. > Originally back in April, I started off using the "average of the > means", but that let through way too much spam. > > So, what I have now is it set to 30% above the average spam score, which > is 20% below the "average of the means". > The assumption being that the optimal spot is somewhere between the two > averages. > > Also, that nastly drop off that produces a lot of FPs is in my intuition > too and as of yet, we haven't run into it. > > Now, if the two curves could be slid apart wider so that there is a big > deadzone,... Although, without upgrading to a newer version of SA, I > don't see how I can expect much better results. > > BTW, if anyone knows a command line program that can easy run thu a > bunch of mbox files and tell how many messages are in them, I will > report back how many ham and how many spam messages that I have fed to > bayes. It's far from perfect, but it may offer some interesting info > regarding the 100:1 (fn:fp) ratio. > > Joe > > > Matt Kettler wrote: > > >Joe Flowers wrote: > > > > > >>Matt Kettler wrote: > >> > >> > >> > >>>The only problem I see with this approach is that it treats false > >>>positives and > >>>false negatives as being equally bad. > >>> > >>> > >>> > >>> > >>We do get many more false negatives than false positives, even though we > >>don't get false positives very often - they are rare. > >>We certainly don't get 1 fp for every fn. > >> > >> > >> > >>>In general, you're adjusting the score bias so the number of FP's and > >>>FNs are > >>>approximately equal. > >>> > >>> > >>This is not what we are seeing in practice. It's not even close to 50-50. > >> > >> > >> > > > >Based on JM's comments about the score distribution for hams being non-linear, > >this makes sense. If the distribution was linear for both you'd get 50/50 by > >dividing the score between the two means. > > > >Since the ham is going to have a pretty sharp drop-off somewhere slightly above > >it's mean your split score approach won't be as bad as 1:1, but it's also likely > >to not be as good as 100:1 which the 5.0 threshold should get you. > > > >It's an interesting concept, and it would be very interesting to graph out FP vs > >FN rates against thresholds. > > > >This graph from JM's post is real data: > >http://spamassassin.apache.org/presentations/HEANet_2002/img12.html > > > >But it doesn't go below 5.0. It would be interesting to see how those curves > >continue as you approach 0. > > > >This graph is a good conceptual one in the "normal" sense of numbers: > >http://taint.org/xfer/2005/score-dist-doodle.gif > > > >That graph would suggest that somewhere below 5.0 there is a threshold at which > >the ham FP rate gets MUCH worse in a very sudden way. However, there's no score > >associated. I'd venture to guess that your "average of the means" is going to > >wind up picking something near, but just above that threshold. > > > >That's a bit of an intuitive guess, but also it has some roots in reality. The > >average score of a ham message on a curve like that is going to wind up being > >somewhere in the middle of that nasty drop off. By biasing just above that you > >should bring yourself into the second part of the curve, where decreases in > >score have a somewhat modest impact on FP rate. > > > > > > >
Re: update on floating dividing score between spam and ham messages
jdow wrote: > The greater the separation the > better the results for a decision point between them. > But anything you can do that widens the > typical score distribution between ham and spam is a good thing. Amen
Re: update on floating dividing score between spam and ham messages
> There's another thing worth noting -- the SpamAssassin score distribution > for hams and spams isn't even. I don't necessarily see that those particular curve shapes necessarily in any way invalidate this method, although they do bias the method somewhat. The two curves are essentially smooth curves with no major dips or bumps in them, so it is possible to select a ratio without getting inversions in the ratio as the selector moves from left to right. You may have to be careful of calculating the ratio, given that ham goes to effectively zero above a certain value. But n:0 and 3.45n:0 are still perfectly valid ratios to deal with, even if one of the terms is zero. Loren
Re: update on floating dividing score between spam and ham messages
Matt: I know you know a lot more about this than I do, but for what it's worth, you're impressions/intuitions are very close to mine. Originally back in April, I started off using the "average of the means", but that let through way too much spam. So, what I have now is it set to 30% above the average spam score, which is 20% below the "average of the means". The assumption being that the optimal spot is somewhere between the two averages. Also, that nastly drop off that produces a lot of FPs is in my intuition too and as of yet, we haven't run into it. Now, if the two curves could be slid apart wider so that there is a big deadzone,... Although, without upgrading to a newer version of SA, I don't see how I can expect much better results. BTW, if anyone knows a command line program that can easy run thu a bunch of mbox files and tell how many messages are in them, I will report back how many ham and how many spam messages that I have fed to bayes. It's far from perfect, but it may offer some interesting info regarding the 100:1 (fn:fp) ratio. Joe Matt Kettler wrote: Joe Flowers wrote: Matt Kettler wrote: The only problem I see with this approach is that it treats false positives and false negatives as being equally bad. We do get many more false negatives than false positives, even though we don't get false positives very often - they are rare. We certainly don't get 1 fp for every fn. In general, you're adjusting the score bias so the number of FP's and FNs are approximately equal. This is not what we are seeing in practice. It's not even close to 50-50. Based on JM's comments about the score distribution for hams being non-linear, this makes sense. If the distribution was linear for both you'd get 50/50 by dividing the score between the two means. Since the ham is going to have a pretty sharp drop-off somewhere slightly above it's mean your split score approach won't be as bad as 1:1, but it's also likely to not be as good as 100:1 which the 5.0 threshold should get you. It's an interesting concept, and it would be very interesting to graph out FP vs FN rates against thresholds. This graph from JM's post is real data: http://spamassassin.apache.org/presentations/HEANet_2002/img12.html But it doesn't go below 5.0. It would be interesting to see how those curves continue as you approach 0. This graph is a good conceptual one in the "normal" sense of numbers: http://taint.org/xfer/2005/score-dist-doodle.gif That graph would suggest that somewhere below 5.0 there is a threshold at which the ham FP rate gets MUCH worse in a very sudden way. However, there's no score associated. I'd venture to guess that your "average of the means" is going to wind up picking something near, but just above that threshold. That's a bit of an intuitive guess, but also it has some roots in reality. The average score of a ham message on a curve like that is going to wind up being somewhere in the middle of that nasty drop off. By biasing just above that you should bring yourself into the second part of the curve, where decreases in score have a somewhat modest impact on FP rate.
Re: update on floating dividing score between spam and ham messages
> > score of -2.1532284. I have the divding line "set" at 30% of the > > distance between the average ham score and average spam score (30% above > > the average ham score). So, the dividing line is currently floating > > around 0.55416414. > > > The only problem I see with this approach is that it treats false positives and > false negatives as being equally bad. Matt, isn't he actually treating an FP as ~2x as bad as an FN? He has the divider set to 30%, so is biassed in one direction or the other. Which of course means that by picking the ratio value you can pick pretty much any fp/fn ratio you want. Loren
Re: update on floating dividing score between spam and ham messages
From: "Matt Kettler" <[EMAIL PROTECTED]> > Joe Flowers wrote: > > I don't know if this will help anyone or not, but I wanted to report > > back just in case. > > > > In early April, I completely unhinged the dividing line between what SA > > score is used to mark a message as spam or ham (5.00 = default). This > > allows the system and this dividing line to drift "freely" to anywhere > > that SA will allow, without bound. This anti-spam setup has worked > > consistently much much better the whole time than in any previous > > implementation that we have done and with very little maintenance. We > > are very happy with it and are looking forward to implementing future SA > > versions in the same fashion. > > > > I'm not exactly sure the following numbers represent the whole time > > since April, but they should be pretty close. > > > > We've had 360,922 spam messages and 396,983 ham messages with a > > normalized average spam score of 6.8714134 and a normalized average ham > > score of -2.1532284. I have the divding line "set" at 30% of the > > distance between the average ham score and average spam score (30% above > > the average ham score). So, the dividing line is currently floating > > around 0.55416414. > > > The only problem I see with this approach is that it treats false positives and > false negatives as being equally bad. > > In general, you're adjusting the score bias so the number of FP's and FNs are > approximately equal. Although STATISTICS*.txt would suggest that this boundary > occurs somewhere near 2.0, your own local biases could change this considerably. > > > SA's normal scoreset is evolved with the concept that it's better to have 99 > false negatives than 1 false positive. The concept here is most people use > scripts to move their spam into a separate folder, or auto delete it. With that > going on, a FP is potentially lost valid email, whereas a FN is a minor > inconvenience. Operating experience here seems to indicate that the SA score evolution is not optimum. What you want to do is create a brassiere curve for the markups for ham and spam. The greater the separation the better the results for a decision point between them. The bias to prevent false negatives probably means you do not want the decision point right in the center. But anything you can do that widens the typical score distribution between ham and spam is a good thing. It makes the decision point less sensitive to set and the overall error rates lower. I think this is part of the reason I have so much success on a box vastly overloaded with SARE and other rules. The good rules pile one on the other until it's VERY clear what is ham and what is spam. (It surely would be nice if there were some really good indications of "not spam". However, nothing has ever appeared other than absence of hits on spam-sign.) {^_^}
Re: update on floating dividing score between spam and ham messages
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 the real-world figures can be seen for various thresholds in the rules/STATISTICS*.txt files... - --j. Matt Kettler writes: > Joe Flowers wrote: > > Matt Kettler wrote: > > > >> The only problem I see with this approach is that it treats false > >> positives and > >> false negatives as being equally bad. > >> > >> > > > > We do get many more false negatives than false positives, even though we > > don't get false positives very often - they are rare. > > We certainly don't get 1 fp for every fn. > > > >> In general, you're adjusting the score bias so the number of FP's and > >> FNs are > >> approximately equal. > > > > > > This is not what we are seeing in practice. It's not even close to 50-50. > > > > Based on JM's comments about the score distribution for hams being non-linear, > this makes sense. If the distribution was linear for both you'd get 50/50 by > dividing the score between the two means. > > Since the ham is going to have a pretty sharp drop-off somewhere slightly > above > it's mean your split score approach won't be as bad as 1:1, but it's also > likely > to not be as good as 100:1 which the 5.0 threshold should get you. > > It's an interesting concept, and it would be very interesting to graph out FP > vs > FN rates against thresholds. > > This graph from JM's post is real data: > http://spamassassin.apache.org/presentations/HEANet_2002/img12.html > > But it doesn't go below 5.0. It would be interesting to see how those curves > continue as you approach 0. > > This graph is a good conceptual one in the "normal" sense of numbers: > http://taint.org/xfer/2005/score-dist-doodle.gif > > That graph would suggest that somewhere below 5.0 there is a threshold at > which > the ham FP rate gets MUCH worse in a very sudden way. However, there's no > score > associated. I'd venture to guess that your "average of the means" is going to > wind up picking something near, but just above that threshold. > > That's a bit of an intuitive guess, but also it has some roots in reality. The > average score of a ham message on a curve like that is going to wind up being > somewhere in the middle of that nasty drop off. By biasing just above that you > should bring yourself into the second part of the curve, where decreases in > score have a somewhat modest impact on FP rate. -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Exmh CVS iD8DBQFC0q8dMJF5cimLx9ARAuLrAKCQnoc8eo2rAvIDYIWX0DfW/T0NZgCePoyH WZS8C6aamuWZ3H6C6n8k2n0= =Hruw -END PGP SIGNATURE-
Re: update on floating dividing score between spam and ham messages
Joe Flowers wrote: > Matt Kettler wrote: > >> The only problem I see with this approach is that it treats false >> positives and >> false negatives as being equally bad. >> >> > > We do get many more false negatives than false positives, even though we > don't get false positives very often - they are rare. > We certainly don't get 1 fp for every fn. > >> In general, you're adjusting the score bias so the number of FP's and >> FNs are >> approximately equal. > > > This is not what we are seeing in practice. It's not even close to 50-50. > Based on JM's comments about the score distribution for hams being non-linear, this makes sense. If the distribution was linear for both you'd get 50/50 by dividing the score between the two means. Since the ham is going to have a pretty sharp drop-off somewhere slightly above it's mean your split score approach won't be as bad as 1:1, but it's also likely to not be as good as 100:1 which the 5.0 threshold should get you. It's an interesting concept, and it would be very interesting to graph out FP vs FN rates against thresholds. This graph from JM's post is real data: http://spamassassin.apache.org/presentations/HEANet_2002/img12.html But it doesn't go below 5.0. It would be interesting to see how those curves continue as you approach 0. This graph is a good conceptual one in the "normal" sense of numbers: http://taint.org/xfer/2005/score-dist-doodle.gif That graph would suggest that somewhere below 5.0 there is a threshold at which the ham FP rate gets MUCH worse in a very sudden way. However, there's no score associated. I'd venture to guess that your "average of the means" is going to wind up picking something near, but just above that threshold. That's a bit of an intuitive guess, but also it has some roots in reality. The average score of a ham message on a curve like that is going to wind up being somewhere in the middle of that nasty drop off. By biasing just above that you should bring yourself into the second part of the curve, where decreases in score have a somewhat modest impact on FP rate.
Re: update on floating dividing score between spam and ham messages
Thanks Jason! That's good, new info for me. That'll help me *at the very least* visualize what I am trying to do a little better. I've been very curious to know what the rough shapes of those graphs look like. Joe Justin Mason wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 There's another thing worth noting -- the SpamAssassin score distribution for hams and spams isn't even. If you draw a graph of hams and spams, plotting the number of mails in each category as the vertical axis and the score they get as teh horizontal axis, you don't get a simple pair of intersecting straight lines. Instead, since we have many more mark-as-spam rules than mark-as-ham, and due to how the perceptron attempts to optimise for the 5.0 threshold, what happens is that you have two different lines. The ham line is a sigmoid curve, that starts high in the negative area, and curves down to almost 0 at the 5.0 threshold mark. The spam line, by contrast, is a straight line. http://taint.org/xfer/2005/score-dist-doodle.gif is a doodle to illustrate this, or take a look at http://spamassassin.apache.org/presentations/HEANet_2002/img12.html for real-world graphs of this data from 2002 -- although graphing the inverse. Very interesting approach though! - --j.
Re: update on floating dividing score between spam and ham messages
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 There's another thing worth noting -- the SpamAssassin score distribution for hams and spams isn't even. If you draw a graph of hams and spams, plotting the number of mails in each category as the vertical axis and the score they get as teh horizontal axis, you don't get a simple pair of intersecting straight lines. Instead, since we have many more mark-as-spam rules than mark-as-ham, and due to how the perceptron attempts to optimise for the 5.0 threshold, what happens is that you have two different lines. The ham line is a sigmoid curve, that starts high in the negative area, and curves down to almost 0 at the 5.0 threshold mark. The spam line, by contrast, is a straight line. http://taint.org/xfer/2005/score-dist-doodle.gif is a doodle to illustrate this, or take a look at http://spamassassin.apache.org/presentations/HEANet_2002/img12.html for real-world graphs of this data from 2002 -- although graphing the inverse. Very interesting approach though! - --j. Joe Flowers writes: > Matt Kettler wrote: > > >The only problem I see with this approach is that it treats false positives > >and > >false negatives as being equally bad. > > > > > > We do get many more false negatives than false positives, even though we > don't get false positives very often - they are rare. > We certainly don't get 1 fp for every fn. > > >In general, you're adjusting the score bias so the number of FP's and FNs are > >approximately equal. > > > > This is not what we are seeing in practice. It's not even close to 50-50. > > >Although STATISTICS*.txt would suggest that this boundary > >occurs somewhere near 2.0, your own local biases could change this > >considerably. > > > > > >SA's normal scoreset is evolved with the concept that it's better to have 99 > >false negatives than 1 false positive. > > > > We are very glad and happy about this concept and implementation. > > >The concept here is most people use > >scripts to move their spam into a separate folder, or auto delete it. With > >that > >going on, a FP is potentially lost valid email, whereas a FN is a minor > >inconvenience. > > > > > > Yes We work hard to inform our users and to actively solicit their > feedback on how the system is working and to lookout for the system > misplacing emails, especially valid ones. I know it's still not perfect > > >For any site that considers FPs to be "not too bad" because all mail is > >manually > >examined anyway, lowering the score threshold may be a workable thing. > > > >However, other sites that auto-delete such messages may have considerable > >problems if they lower the threshold > > > > > > YES! -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Exmh CVS iD8DBQFC0qYfMJF5cimLx9ARAp+YAJ0X7eoijcnMOE+3WkOlfQQEzasjwgCfZp9B TdyM6BfLga48fgif1AzBW7U= =qdan -END PGP SIGNATURE-
Re: update on floating dividing score between spam and ham messages
Matt Kettler wrote: The only problem I see with this approach is that it treats false positives and false negatives as being equally bad. We do get many more false negatives than false positives, even though we don't get false positives very often - they are rare. We certainly don't get 1 fp for every fn. In general, you're adjusting the score bias so the number of FP's and FNs are approximately equal. This is not what we are seeing in practice. It's not even close to 50-50. Although STATISTICS*.txt would suggest that this boundary occurs somewhere near 2.0, your own local biases could change this considerably. SA's normal scoreset is evolved with the concept that it's better to have 99 false negatives than 1 false positive. We are very glad and happy about this concept and implementation. The concept here is most people use scripts to move their spam into a separate folder, or auto delete it. With that going on, a FP is potentially lost valid email, whereas a FN is a minor inconvenience. Yes We work hard to inform our users and to actively solicit their feedback on how the system is working and to lookout for the system misplacing emails, especially valid ones. I know it's still not perfect For any site that considers FPs to be "not too bad" because all mail is manually examined anyway, lowering the score threshold may be a workable thing. However, other sites that auto-delete such messages may have considerable problems if they lower the threshold YES!
Re: update on floating dividing score between spam and ham messages
Joe Flowers wrote: > I don't know if this will help anyone or not, but I wanted to report > back just in case. > > In early April, I completely unhinged the dividing line between what SA > score is used to mark a message as spam or ham (5.00 = default). This > allows the system and this dividing line to drift "freely" to anywhere > that SA will allow, without bound. This anti-spam setup has worked > consistently much much better the whole time than in any previous > implementation that we have done and with very little maintenance. We > are very happy with it and are looking forward to implementing future SA > versions in the same fashion. > > I'm not exactly sure the following numbers represent the whole time > since April, but they should be pretty close. > > We've had 360,922 spam messages and 396,983 ham messages with a > normalized average spam score of 6.8714134 and a normalized average ham > score of -2.1532284. I have the divding line "set" at 30% of the > distance between the average ham score and average spam score (30% above > the average ham score). So, the dividing line is currently floating > around 0.55416414. The only problem I see with this approach is that it treats false positives and false negatives as being equally bad. In general, you're adjusting the score bias so the number of FP's and FNs are approximately equal. Although STATISTICS*.txt would suggest that this boundary occurs somewhere near 2.0, your own local biases could change this considerably. SA's normal scoreset is evolved with the concept that it's better to have 99 false negatives than 1 false positive. The concept here is most people use scripts to move their spam into a separate folder, or auto delete it. With that going on, a FP is potentially lost valid email, whereas a FN is a minor inconvenience. For any site that considers FPs to be "not too bad" because all mail is manually examined anyway, lowering the score threshold may be a workable thing. However, other sites that auto-delete such messages may have considerable problems if they lower the threshold.
Re: update on floating dividing score between spam and ham messages
Loren Wilton wrote: This is quite interesting, and seems reasonably obvious that with the right sort of mail (at least, maybe with any mail) this shoudl work better, since it self tunes to your conditions. It does of course assume a reasonable fp/fn rate to start, but SA is generally pretty good about that. Yes, and that's the assumption - that SA is good at what it does and has a "practical" desert of more than infinitesimal width somewhere/anywhere between real spam and real ham. How have you implemented the floating score? A change to SA of some sort? I have a small C program which is a very small subset of spamd. I don't use spamc or spamd. I just make calls directly into Perl (to Spam Assassin since SA is a Perl module) from the C program. The C program is a multithreaded socket program that talks to SA and to our mail servers. Or do you just have something that monitors the current ratio and every so often restarts SA with a new score threshhold? The averages are recalculated/updated after every message. The "dividing" line floats without bound, but the "30%" is "set" as a command line input argument to the C program and was based on trial and error. For example, 50% let way too much spam through. Lower than 30% *seems* like it might produce too many false positives - but has not been tested and it seems reasonably likely that 25% could have just as many FPs as 30%. i.e., an electron could jump across into the next energy band without ever being in the middle no-man's land - giving rise to a way to measure SA's imperfectness? These percentages are stabs in the dark about what the distributions of spam and ham really look like. What is intriguing to me is: why can't this "30%" be a function of something else (maybe like weekend mail versus weekday mail, night versus day, etc.) that would increase the overall anti-spam effectiveness and not just set and left at "30%"? (For example, I noticed that we seem to get a higher spam to ham ratio during the weekend than during the week.) Joe
Re: update on floating dividing score between spam and ham messages
This is quite interesting, and seems reasonably obvious that with the right sort of mail (at least, maybe with any mail) this shoudl work better, since it self tunes to your conditions. It does of course assume a reasonable fp/fn rate to start, but SA is generally pretty good about that. How have you implemented the floating score? A change to SA of some sort? Or do you just have something that monitors the current ratio and every so often restarts SA with a new score threshhold? Loren
update on floating dividing score between spam and ham messages
I don't know if this will help anyone or not, but I wanted to report back just in case. In early April, I completely unhinged the dividing line between what SA score is used to mark a message as spam or ham (5.00 = default). This allows the system and this dividing line to drift "freely" to anywhere that SA will allow, without bound. This anti-spam setup has worked consistently much much better the whole time than in any previous implementation that we have done and with very little maintenance. We are very happy with it and are looking forward to implementing future SA versions in the same fashion. I'm not exactly sure the following numbers represent the whole time since April, but they should be pretty close. We've had 360,922 spam messages and 396,983 ham messages with a normalized average spam score of 6.8714134 and a normalized average ham score of -2.1532284. I have the divding line "set" at 30% of the distance between the average ham score and average spam score (30% above the average ham score). So, the dividing line is currently floating around 0.55416414. Apart from the default SA install, the only thing I have changed is 1. Turned off auto-learn <--- I think this is very important. 2. Set SA to ignore our custom spam score tag in the message headers. We are currently running SA v3.02. From time to time, but not very often (a couple of times every two weeks or so), I do feed bayes (sa-learn) with a few messages that are misplaced. I don't know the stats, but we have very few false positives, so I'm mostly feeding bayes with the false negatives which consist of the new/different message tricks that the spammers are using. Everyone here has been very happy with the results. It's been much much better than any implementation in the past. Many thanks to the SA developers! Rock on! Joe