Re: update on floating dividing score between spam and ham messages

jdow Mon, 11 Jul 2005 12:10:53 -0700

A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot
that uses an indexed almost "mbox" file. There is no way to do it other
than "good guess". However, for a traditional UNIX mbox file you should
be able to nail it perfectly simply looking for the "From" feature. The
dirt stupid "mail" utility looks for a blank line followed by a line
that starts with "From". All other lines that start with From are supposed
to be escaped to ensure accurate detection. DoveCot skips this blank like
feature sometimes. "mail" does not like this. I have not yet seen any
indication that SA is upset with this, however.


{^_^}
----- Original Message ----- 
From: "Joe Flowers" <[EMAIL PROTECTED]>

> Matt:
>
> I know you know a lot more about this than I do, but for what it's
> worth, you're impressions/intuitions are very close to mine.
> Originally back in April, I started off using the "average of the
> means", but that let through way too much spam.
>
> So, what I have now is it set to 30% above the average spam score, which
> is 20% below the "average of the means".
> The assumption being that the optimal spot is somewhere between the two
> averages.
>
> Also, that nastly drop off that produces a lot of FPs is in my intuition
> too and as of yet, we haven't run into it.
>
> Now, if the two curves could be slid apart wider so that there is a big
> deadzone,... Although, without upgrading to a newer version of SA, I
> don't see how I can expect much better results.
>
> BTW, if anyone knows a command line program that can easy run thu a
> bunch of mbox files and tell how many messages are in them, I will
> report back how many ham and how many spam messages that I have fed to
> bayes. It's far from perfect, but it may offer some interesting info
> regarding the 100:1 (fn:fp) ratio.
>
> Joe
>
>
> Matt Kettler wrote:
>
> >Joe Flowers wrote:
> >
> >
> >>Matt Kettler wrote:
> >>
> >>
> >>
> >>>The only problem I see with this approach is that it treats false
> >>>positives and
> >>>false negatives as being equally bad.
> >>>
> >>>
> >>>
> >>>
> >>We do get many more false negatives than false positives, even though we
> >>don't get false positives very often - they are rare.
> >>We certainly don't get 1 fp for every fn.
> >>
> >>
> >>
> >>>In general, you're adjusting the score bias so the number of FP's and
> >>>FNs are
> >>>approximately equal.
> >>>
> >>>
> >>This is not what we are seeing in practice. It's not even close to
50-50.
> >>
> >>
> >>
> >
> >Based on JM's comments about the score distribution for hams being
non-linear,
> >this makes sense. If the distribution was linear for both you'd get 50/50
by
> >dividing the score between the two means.
> >
> >Since the ham is going to have a pretty sharp drop-off somewhere slightly
above
> >it's mean your split score approach won't be as bad as 1:1, but it's also
likely
> >to not be as good as 100:1 which the 5.0 threshold should get you.
> >
> >It's an interesting concept, and it would be very interesting to graph
out FP vs
> >FN rates against thresholds.
> >
> >This graph from JM's post is real data:
> >http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
> >
> >But it doesn't go below 5.0. It would be interesting to see how those
curves
> >continue as you approach 0.
> >
> >This graph is a good conceptual one in the "normal" sense of numbers:
> >http://taint.org/xfer/2005/score-dist-doodle.gif
> >
> >That graph would suggest that somewhere below 5.0 there is a threshold at
which
> >the ham FP rate gets MUCH worse in a very sudden way. However, there's no
score
> >associated. I'd venture to guess that your "average of the means" is
going to
> >wind up picking something near, but just above that threshold.
> >
> >That's a bit of an intuitive guess, but also it has some roots in
reality. The
> >average score of a ham message on a curve like that is going to wind up
being
> >somewhere in the middle of that nasty drop off. By biasing just above
that you
> >should bring yourself into the second part of the curve, where decreases
in
> >score have a somewhat modest impact on FP rate.
> >
> >
> >
>

Re: update on floating dividing score between spam and ham messages

Reply via email to