Justin,
Do you have suggestions on how I should come up with the two boundary
lines and what do I do with the "unsure" messages?
I'm all ears.
Joe
Justin Mason wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
btw, I was just rereading this -- an interesting approach you might
want t
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
btw, I was just rereading this -- an interesting approach you might
want to experiment with, is having *two* boundaries. ie:
negative scores positive scores
<|---|---
Joe Flowers wrote on Tue, 12 Jul 2005 11:55:36 -0400:
> >That's bad, really bad
> >detection ...
> >
> >
>
> No. It's good, really good detection.
Sorry, I don't want to be rude by repeating myself, but if your average spam
score is something like 6-something the *detection* *is* bad. M
Kai Schaetzl wrote:
Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400:
That's bad, really bad
detection ...
No. It's good, really good detection.
You should improve that instead of trying to find a
barrier which gives you the best FP:FN ratio.
I'm not trying to find the best F
Joe Flowers wrote:
BTW, if anyone knows a command line program that can easy run thu a
bunch of mbox files and tell how many messages are in them, I will
report back how many ham and how many spam messages that I have fed to
bayes. It's far from perfect, but it may offer some interesting info
> BTW, if anyone knows a command line program that can easy run thu a
bunch of mbox files and tell how many messages are in them, I will report
> back how many ham and how many spam messages that I have fed to bayes.
Well, I thought this might give some good stats on the FP:FN ratio, but
I for
Kai Schaetzl wrote on Mon, 11 Jul 2005 22:31:29 +0200:
> With the default of 5 we get almost none, not even one per day.
That was about FPs. Wrong. We don't get *any* FPs. We do not get even one
*FN* per day.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: htt
jdow wrote:
> A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot
> that uses an indexed almost "mbox" file. There is no way to do it
> other than "good guess". However, for a traditional UNIX mbox file
> you should be able to nail it perfectly simply looking for the "From"
> featu
Loren Wilton wrote on Mon, 11 Jul 2005 11:30:07 -0700:
> Which of course means that by picking the ratio value you can pick pretty
> much any fp/fn ratio you want.
Only if the distribution was equal.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.c
Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400:
> We are very glad and happy about this concept and implementation.
Well, the big question is: How many of your spam messages score between
the default 5 and your "floating score"? If it is many there's obviously
something wrong with your se
A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot
that uses an indexed almost "mbox" file. There is no way to do it other
than "good guess". However, for a traditional UNIX mbox file you should
be able to nail it perfectly simply looking for the "From" feature. The
dirt stupid "m
jdow wrote:
> The greater the separation the
> better the results for a decision point between them.
> But anything you can do that widens the
> typical score distribution between ham and spam is a good thing.
Amen
> There's another thing worth noting -- the SpamAssassin score distribution
> for hams and spams isn't even.
I don't necessarily see that those particular curve shapes necessarily in
any way invalidate this method, although they do bias the method somewhat.
The two curves are essentially smooth cu
Matt:
I know you know a lot more about this than I do, but for what it's
worth, you're impressions/intuitions are very close to mine.
Originally back in April, I started off using the "average of the
means", but that let through way too much spam.
So, what I have now is it set to 30% above th
> > score of -2.1532284. I have the divding line "set" at 30% of the
> > distance between the average ham score and average spam score (30% above
> > the average ham score). So, the dividing line is currently floating
> > around 0.55416414.
>
>
> The only problem I see with this approach is that i
From: "Matt Kettler" <[EMAIL PROTECTED]>
> Joe Flowers wrote:
> > I don't know if this will help anyone or not, but I wanted to report
> > back just in case.
> >
> > In early April, I completely unhinged the dividing line between what SA
> > score is used to mark a message as spam or ham (5.00 = d
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
the real-world figures can be seen for various thresholds in
the rules/STATISTICS*.txt files...
- --j.
Matt Kettler writes:
> Joe Flowers wrote:
> > Matt Kettler wrote:
> >
> >> The only problem I see with this approach is that it treats false
> >>
Joe Flowers wrote:
> Matt Kettler wrote:
>
>> The only problem I see with this approach is that it treats false
>> positives and
>> false negatives as being equally bad.
>>
>>
>
> We do get many more false negatives than false positives, even though we
> don't get false positives very often - t
Thanks Jason!
That's good, new info for me. That'll help me *at the very least*
visualize what I am trying to do a little better. I've been very curious
to know what the rough shapes of those graphs look like.
Joe
Justin Mason wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
There'
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
There's another thing worth noting -- the SpamAssassin score distribution
for hams and spams isn't even.
If you draw a graph of hams and spams, plotting the number of mails in
each category as the vertical axis and the score they get as teh
horizonta
Matt Kettler wrote:
The only problem I see with this approach is that it treats false positives and
false negatives as being equally bad.
We do get many more false negatives than false positives, even though we
don't get false positives very often - they are rare.
We certainly don't get 1
Joe Flowers wrote:
> I don't know if this will help anyone or not, but I wanted to report
> back just in case.
>
> In early April, I completely unhinged the dividing line between what SA
> score is used to mark a message as spam or ham (5.00 = default). This
> allows the system and this dividing l
Loren Wilton wrote:
This is quite interesting, and seems reasonably obvious that with the right
sort of mail (at least, maybe with any mail) this shoudl work better, since
it self tunes to your conditions. It does of course assume a reasonable
fp/fn rate to start, but SA is generally pretty goo
This is quite interesting, and seems reasonably obvious that with the right
sort of mail (at least, maybe with any mail) this shoudl work better, since
it self tunes to your conditions. It does of course assume a reasonable
fp/fn rate to start, but SA is generally pretty good about that.
How have
24 matches
Mail list logo