-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
btw, I was just rereading this -- an interesting approach you might
want to experiment with, is having *two* boundaries. ie:
negative scores positive scores
Justin,
Do you have suggestions on how I should come up with the two boundary
lines and what do I do with the unsure messages?
I'm all ears.
Joe
Justin Mason wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
btw, I was just rereading this -- an interesting approach you might
want to
Kai Schaetzl wrote:
Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400:
That's bad, really bad
detection ...
No. It's good, really good detection.
You should improve that instead of trying to find a
barrier which gives you the best FP:FN ratio.
I'm not trying to find the best
Joe Flowers wrote on Tue, 12 Jul 2005 11:55:36 -0400:
That's bad, really bad
detection ...
No. It's good, really good detection.
Sorry, I don't want to be rude by repeating myself, but if your average spam
score is something like 6-something the *detection* *is* bad. Maybe not
Joe Flowers wrote:
I don't know if this will help anyone or not, but I wanted to report
back just in case.
In early April, I completely unhinged the dividing line between what SA
score is used to mark a message as spam or ham (5.00 = default). This
allows the system and this dividing line
Matt Kettler wrote:
The only problem I see with this approach is that it treats false positives and
false negatives as being equally bad.
We do get many more false negatives than false positives, even though we
don't get false positives very often - they are rare.
We certainly don't get
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
There's another thing worth noting -- the SpamAssassin score distribution
for hams and spams isn't even.
If you draw a graph of hams and spams, plotting the number of mails in
each category as the vertical axis and the score they get as teh
Thanks Jason!
That's good, new info for me. That'll help me *at the very least*
visualize what I am trying to do a little better. I've been very curious
to know what the rough shapes of those graphs look like.
Joe
Justin Mason wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Joe Flowers wrote:
Matt Kettler wrote:
The only problem I see with this approach is that it treats false
positives and
false negatives as being equally bad.
We do get many more false negatives than false positives, even though we
don't get false positives very often - they are rare.
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
the real-world figures can be seen for various thresholds in
the rules/STATISTICS*.txt files...
- --j.
Matt Kettler writes:
Joe Flowers wrote:
Matt Kettler wrote:
The only problem I see with this approach is that it treats false
positives
From: Matt Kettler [EMAIL PROTECTED]
Joe Flowers wrote:
I don't know if this will help anyone or not, but I wanted to report
back just in case.
In early April, I completely unhinged the dividing line between what SA
score is used to mark a message as spam or ham (5.00 = default). This
score of -2.1532284. I have the divding line set at 30% of the
distance between the average ham score and average spam score (30% above
the average ham score). So, the dividing line is currently floating
around 0.55416414.
The only problem I see with this approach is that it treats
Matt:
I know you know a lot more about this than I do, but for what it's
worth, you're impressions/intuitions are very close to mine.
Originally back in April, I started off using the average of the
means, but that let through way too much spam.
So, what I have now is it set to 30% above the
There's another thing worth noting -- the SpamAssassin score distribution
for hams and spams isn't even.
I don't necessarily see that those particular curve shapes necessarily in
any way invalidate this method, although they do bias the method somewhat.
The two curves are essentially smooth
jdow wrote:
The greater the separation choke the
better the results for a decision point between them.
But anything you can do that widens the
typical score distribution between ham and spam is a good thing.
Amen
A few weeks ago I'd have said Easy, Ducky! Then I ran into DoveCot
that uses an indexed almost mbox file. There is no way to do it other
than good guess. However, for a traditional UNIX mbox file you should
be able to nail it perfectly simply looking for the From feature. The
dirt stupid mail
Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400:
We are very glad and happy about this concept and implementation.
Well, the big question is: How many of your spam messages score between
the default 5 and your floating score? If it is many there's obviously
something wrong with your
Loren Wilton wrote on Mon, 11 Jul 2005 11:30:07 -0700:
Which of course means that by picking the ratio value you can pick pretty
much any fp/fn ratio you want.
Only if the distribution was equal.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services:
jdow wrote:
A few weeks ago I'd have said Easy, Ducky! Then I ran into DoveCot
that uses an indexed almost mbox file. There is no way to do it
other than good guess. However, for a traditional UNIX mbox file
you should be able to nail it perfectly simply looking for the From
feature. The dirt
Kai Schaetzl wrote on Mon, 11 Jul 2005 22:31:29 +0200:
With the default of 5 we get almost none, not even one per day.
That was about FPs. Wrong. We don't get *any* FPs. We do not get even one
*FN* per day.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services:
BTW, if anyone knows a command line program that can easy run thu a
bunch of mbox files and tell how many messages are in them, I will report
back how many ham and how many spam messages that I have fed to bayes.
Well, I thought this might give some good stats on the FP:FN ratio, but
I
Joe Flowers wrote:
BTW, if anyone knows a command line program that can easy run thu a
bunch of mbox files and tell how many messages are in them, I will
report back how many ham and how many spam messages that I have fed to
bayes. It's far from perfect, but it may offer some interesting info
Loren Wilton wrote:
This is quite interesting, and seems reasonably obvious that with the right
sort of mail (at least, maybe with any mail) this shoudl work better, since
it self tunes to your conditions. It does of course assume a reasonable
fp/fn rate to start, but SA is generally pretty
23 matches
Mail list logo