Re: update on floating dividing score between spam and ham messages

2005-07-18 Thread Joe Flowers
Justin, Do you have suggestions on how I should come up with the two boundary lines and what do I do with the "unsure" messages? I'm all ears. Joe Justin Mason wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 btw, I was just rereading this -- an interesting approach you might want t

Re: update on floating dividing score between spam and ham messages

2005-07-18 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 btw, I was just rereading this -- an interesting approach you might want to experiment with, is having *two* boundaries. ie: negative scores positive scores <|---|---

Re: update on floating dividing score between spam and ham messages

2005-07-12 Thread Kai Schaetzl
Joe Flowers wrote on Tue, 12 Jul 2005 11:55:36 -0400: > >That's bad, really bad > >detection ... > > > > > > No. It's good, really good detection. Sorry, I don't want to be rude by repeating myself, but if your average spam score is something like 6-something the *detection* *is* bad. M

Re: update on floating dividing score between spam and ham messages

2005-07-12 Thread Joe Flowers
Kai Schaetzl wrote: Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400: That's bad, really bad detection ... No. It's good, really good detection. You should improve that instead of trying to find a barrier which gives you the best FP:FN ratio. I'm not trying to find the best F

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kelson
Joe Flowers wrote: BTW, if anyone knows a command line program that can easy run thu a bunch of mbox files and tell how many messages are in them, I will report back how many ham and how many spam messages that I have fed to bayes. It's far from perfect, but it may offer some interesting info

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
> BTW, if anyone knows a command line program that can easy run thu a bunch of mbox files and tell how many messages are in them, I will report > back how many ham and how many spam messages that I have fed to bayes. Well, I thought this might give some good stats on the FP:FN ratio, but I for

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Kai Schaetzl wrote on Mon, 11 Jul 2005 22:31:29 +0200: > With the default of 5 we get almost none, not even one per day. That was about FPs. Wrong. We don't get *any* FPs. We do not get even one *FN* per day. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: htt

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kris Deugau
jdow wrote: > A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot > that uses an indexed almost "mbox" file. There is no way to do it > other than "good guess". However, for a traditional UNIX mbox file > you should be able to nail it perfectly simply looking for the "From" > featu

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Loren Wilton wrote on Mon, 11 Jul 2005 11:30:07 -0700: > Which of course means that by picking the ratio value you can pick pretty > much any fp/fn ratio you want. Only if the distribution was equal. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.c

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400: > We are very glad and happy about this concept and implementation. Well, the big question is: How many of your spam messages score between the default 5 and your "floating score"? If it is many there's obviously something wrong with your se

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread jdow
A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot that uses an indexed almost "mbox" file. There is no way to do it other than "good guess". However, for a traditional UNIX mbox file you should be able to nail it perfectly simply looking for the "From" feature. The dirt stupid "m

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
jdow wrote: > The greater the separation the > better the results for a decision point between them. > But anything you can do that widens the > typical score distribution between ham and spam is a good thing. Amen

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Loren Wilton
> There's another thing worth noting -- the SpamAssassin score distribution > for hams and spams isn't even. I don't necessarily see that those particular curve shapes necessarily in any way invalidate this method, although they do bias the method somewhat. The two curves are essentially smooth cu

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
Matt: I know you know a lot more about this than I do, but for what it's worth, you're impressions/intuitions are very close to mine. Originally back in April, I started off using the "average of the means", but that let through way too much spam. So, what I have now is it set to 30% above th

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Loren Wilton
> > score of -2.1532284. I have the divding line "set" at 30% of the > > distance between the average ham score and average spam score (30% above > > the average ham score). So, the dividing line is currently floating > > around 0.55416414. > > > The only problem I see with this approach is that i

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread jdow
From: "Matt Kettler" <[EMAIL PROTECTED]> > Joe Flowers wrote: > > I don't know if this will help anyone or not, but I wanted to report > > back just in case. > > > > In early April, I completely unhinged the dividing line between what SA > > score is used to mark a message as spam or ham (5.00 = d

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 the real-world figures can be seen for various thresholds in the rules/STATISTICS*.txt files... - --j. Matt Kettler writes: > Joe Flowers wrote: > > Matt Kettler wrote: > > > >> The only problem I see with this approach is that it treats false > >>

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Matt Kettler
Joe Flowers wrote: > Matt Kettler wrote: > >> The only problem I see with this approach is that it treats false >> positives and >> false negatives as being equally bad. >> >> > > We do get many more false negatives than false positives, even though we > don't get false positives very often - t

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
Thanks Jason! That's good, new info for me. That'll help me *at the very least* visualize what I am trying to do a little better. I've been very curious to know what the rough shapes of those graphs look like. Joe Justin Mason wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 There'

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 There's another thing worth noting -- the SpamAssassin score distribution for hams and spams isn't even. If you draw a graph of hams and spams, plotting the number of mails in each category as the vertical axis and the score they get as teh horizonta

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
Matt Kettler wrote: The only problem I see with this approach is that it treats false positives and false negatives as being equally bad. We do get many more false negatives than false positives, even though we don't get false positives very often - they are rare. We certainly don't get 1

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Matt Kettler
Joe Flowers wrote: > I don't know if this will help anyone or not, but I wanted to report > back just in case. > > In early April, I completely unhinged the dividing line between what SA > score is used to mark a message as spam or ham (5.00 = default). This > allows the system and this dividing l

Re: update on floating dividing score between spam and ham messages

2005-07-10 Thread Joe Flowers
Loren Wilton wrote: This is quite interesting, and seems reasonably obvious that with the right sort of mail (at least, maybe with any mail) this shoudl work better, since it self tunes to your conditions. It does of course assume a reasonable fp/fn rate to start, but SA is generally pretty goo

Re: update on floating dividing score between spam and ham messages

2005-07-10 Thread Loren Wilton
This is quite interesting, and seems reasonably obvious that with the right sort of mail (at least, maybe with any mail) this shoudl work better, since it self tunes to your conditions. It does of course assume a reasonable fp/fn rate to start, but SA is generally pretty good about that. How have