Re: update on floating dividing score between spam and ham messages

2005-07-18 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 btw, I was just rereading this -- an interesting approach you might want to experiment with, is having *two* boundaries. ie: negative scores positive scores

Re: update on floating dividing score between spam and ham messages

2005-07-18 Thread Joe Flowers
Justin, Do you have suggestions on how I should come up with the two boundary lines and what do I do with the unsure messages? I'm all ears. Joe Justin Mason wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 btw, I was just rereading this -- an interesting approach you might want to

Re: update on floating dividing score between spam and ham messages

2005-07-12 Thread Joe Flowers
Kai Schaetzl wrote: Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400: That's bad, really bad detection ... No. It's good, really good detection. You should improve that instead of trying to find a barrier which gives you the best FP:FN ratio. I'm not trying to find the best

Re: update on floating dividing score between spam and ham messages

2005-07-12 Thread Kai Schaetzl
Joe Flowers wrote on Tue, 12 Jul 2005 11:55:36 -0400: That's bad, really bad detection ... No. It's good, really good detection. Sorry, I don't want to be rude by repeating myself, but if your average spam score is something like 6-something the *detection* *is* bad. Maybe not

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Matt Kettler
Joe Flowers wrote: I don't know if this will help anyone or not, but I wanted to report back just in case. In early April, I completely unhinged the dividing line between what SA score is used to mark a message as spam or ham (5.00 = default). This allows the system and this dividing line

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
Matt Kettler wrote: The only problem I see with this approach is that it treats false positives and false negatives as being equally bad. We do get many more false negatives than false positives, even though we don't get false positives very often - they are rare. We certainly don't get

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 There's another thing worth noting -- the SpamAssassin score distribution for hams and spams isn't even. If you draw a graph of hams and spams, plotting the number of mails in each category as the vertical axis and the score they get as teh

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
Thanks Jason! That's good, new info for me. That'll help me *at the very least* visualize what I am trying to do a little better. I've been very curious to know what the rough shapes of those graphs look like. Joe Justin Mason wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Matt Kettler
Joe Flowers wrote: Matt Kettler wrote: The only problem I see with this approach is that it treats false positives and false negatives as being equally bad. We do get many more false negatives than false positives, even though we don't get false positives very often - they are rare.

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 the real-world figures can be seen for various thresholds in the rules/STATISTICS*.txt files... - --j. Matt Kettler writes: Joe Flowers wrote: Matt Kettler wrote: The only problem I see with this approach is that it treats false positives

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread jdow
From: Matt Kettler [EMAIL PROTECTED] Joe Flowers wrote: I don't know if this will help anyone or not, but I wanted to report back just in case. In early April, I completely unhinged the dividing line between what SA score is used to mark a message as spam or ham (5.00 = default). This

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Loren Wilton
score of -2.1532284. I have the divding line set at 30% of the distance between the average ham score and average spam score (30% above the average ham score). So, the dividing line is currently floating around 0.55416414. The only problem I see with this approach is that it treats

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
Matt: I know you know a lot more about this than I do, but for what it's worth, you're impressions/intuitions are very close to mine. Originally back in April, I started off using the average of the means, but that let through way too much spam. So, what I have now is it set to 30% above the

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Loren Wilton
There's another thing worth noting -- the SpamAssassin score distribution for hams and spams isn't even. I don't necessarily see that those particular curve shapes necessarily in any way invalidate this method, although they do bias the method somewhat. The two curves are essentially smooth

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
jdow wrote: The greater the separation choke the better the results for a decision point between them. But anything you can do that widens the typical score distribution between ham and spam is a good thing. Amen

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread jdow
A few weeks ago I'd have said Easy, Ducky! Then I ran into DoveCot that uses an indexed almost mbox file. There is no way to do it other than good guess. However, for a traditional UNIX mbox file you should be able to nail it perfectly simply looking for the From feature. The dirt stupid mail

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400: We are very glad and happy about this concept and implementation. Well, the big question is: How many of your spam messages score between the default 5 and your floating score? If it is many there's obviously something wrong with your

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Loren Wilton wrote on Mon, 11 Jul 2005 11:30:07 -0700: Which of course means that by picking the ratio value you can pick pretty much any fp/fn ratio you want. Only if the distribution was equal. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services:

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kris Deugau
jdow wrote: A few weeks ago I'd have said Easy, Ducky! Then I ran into DoveCot that uses an indexed almost mbox file. There is no way to do it other than good guess. However, for a traditional UNIX mbox file you should be able to nail it perfectly simply looking for the From feature. The dirt

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Kai Schaetzl wrote on Mon, 11 Jul 2005 22:31:29 +0200: With the default of 5 we get almost none, not even one per day. That was about FPs. Wrong. We don't get *any* FPs. We do not get even one *FN* per day. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services:

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
BTW, if anyone knows a command line program that can easy run thu a bunch of mbox files and tell how many messages are in them, I will report back how many ham and how many spam messages that I have fed to bayes. Well, I thought this might give some good stats on the FP:FN ratio, but I

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kelson
Joe Flowers wrote: BTW, if anyone knows a command line program that can easy run thu a bunch of mbox files and tell how many messages are in them, I will report back how many ham and how many spam messages that I have fed to bayes. It's far from perfect, but it may offer some interesting info

Re: update on floating dividing score between spam and ham messages

2005-07-10 Thread Joe Flowers
Loren Wilton wrote: This is quite interesting, and seems reasonably obvious that with the right sort of mail (at least, maybe with any mail) this shoudl work better, since it self tunes to your conditions. It does of course assume a reasonable fp/fn rate to start, but SA is generally pretty