>The biggest problem with a score based system with an abrupt cutoff is the
>uncertainty around the threshold. If the GA currently thinks its ok for a
>ham to score 4.9 and still be called ham, and a spam to score 5.1 and
>still be called spam, its not going to make as much effort to get a
>cleaner seperation of scores than if you told it, "ok make sure ham is
>below 4 and spam is above 6 as much as possible". Or am I missing
>something ?

Another point I forgot to mention, is that having a split threshold during
the GA like this may not show up directly in the statistics generated
during the GA run - because the GA is optimizing the scores to match the
specific corpus(es) its running against, so you'd probably get much the
same statistices, just the score distribution of the individual messages
would be slightly different. (In other words the statistics are affected
largely by the ruleset and bayes performance etc rather than the specific
threshold chosen)

However if you then did a statistics run on an *independant* corpus of
similar but (mostly) non overlapping messages, ones which the GA hasn't
had a chance to optimize the scores for, I think there would be a
defintate improvement in FN/FP rate of an independant corpus using the 6/4
threshold instead of 5/5.

Anybody able to blow holes in my theory or suggest a way of proving it ?

Regards,
Simon



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to