A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot that uses an indexed almost "mbox" file. There is no way to do it other than "good guess". However, for a traditional UNIX mbox file you should be able to nail it perfectly simply looking for the "From" feature. The dirt stupid "mail" utility looks for a blank line followed by a line that starts with "From". All other lines that start with From are supposed to be escaped to ensure accurate detection. DoveCot skips this blank like feature sometimes. "mail" does not like this. I have not yet seen any indication that SA is upset with this, however.
{^_^} ----- Original Message ----- From: "Joe Flowers" <[EMAIL PROTECTED]> > Matt: > > I know you know a lot more about this than I do, but for what it's > worth, you're impressions/intuitions are very close to mine. > Originally back in April, I started off using the "average of the > means", but that let through way too much spam. > > So, what I have now is it set to 30% above the average spam score, which > is 20% below the "average of the means". > The assumption being that the optimal spot is somewhere between the two > averages. > > Also, that nastly drop off that produces a lot of FPs is in my intuition > too and as of yet, we haven't run into it. > > Now, if the two curves could be slid apart wider so that there is a big > deadzone,... Although, without upgrading to a newer version of SA, I > don't see how I can expect much better results. > > BTW, if anyone knows a command line program that can easy run thu a > bunch of mbox files and tell how many messages are in them, I will > report back how many ham and how many spam messages that I have fed to > bayes. It's far from perfect, but it may offer some interesting info > regarding the 100:1 (fn:fp) ratio. > > Joe > > > Matt Kettler wrote: > > >Joe Flowers wrote: > > > > > >>Matt Kettler wrote: > >> > >> > >> > >>>The only problem I see with this approach is that it treats false > >>>positives and > >>>false negatives as being equally bad. > >>> > >>> > >>> > >>> > >>We do get many more false negatives than false positives, even though we > >>don't get false positives very often - they are rare. > >>We certainly don't get 1 fp for every fn. > >> > >> > >> > >>>In general, you're adjusting the score bias so the number of FP's and > >>>FNs are > >>>approximately equal. > >>> > >>> > >>This is not what we are seeing in practice. It's not even close to 50-50. > >> > >> > >> > > > >Based on JM's comments about the score distribution for hams being non-linear, > >this makes sense. If the distribution was linear for both you'd get 50/50 by > >dividing the score between the two means. > > > >Since the ham is going to have a pretty sharp drop-off somewhere slightly above > >it's mean your split score approach won't be as bad as 1:1, but it's also likely > >to not be as good as 100:1 which the 5.0 threshold should get you. > > > >It's an interesting concept, and it would be very interesting to graph out FP vs > >FN rates against thresholds. > > > >This graph from JM's post is real data: > >http://spamassassin.apache.org/presentations/HEANet_2002/img12.html > > > >But it doesn't go below 5.0. It would be interesting to see how those curves > >continue as you approach 0. > > > >This graph is a good conceptual one in the "normal" sense of numbers: > >http://taint.org/xfer/2005/score-dist-doodle.gif > > > >That graph would suggest that somewhere below 5.0 there is a threshold at which > >the ham FP rate gets MUCH worse in a very sudden way. However, there's no score > >associated. I'd venture to guess that your "average of the means" is going to > >wind up picking something near, but just above that threshold. > > > >That's a bit of an intuitive guess, but also it has some roots in reality. The > >average score of a ham message on a curve like that is going to wind up being > >somewhere in the middle of that nasty drop off. By biasing just above that you > >should bring yourself into the second part of the curve, where decreases in > >score have a somewhat modest impact on FP rate. > > > > > > >