RW a écrit :
On Tue, 27 Oct 2009 15:01:39 +0100
Sam <liste-spamassas...@ingescom.com> wrote:

RW a écrit :
On Tue, 27 Oct 2009 13:33:14 +0100

If you find it surprising that that can happen, you don't understand
how Bayes works. It's a leaning system that's intended to classify
mail it hasn't seen based on mail it has seen.
I agree with you for non-seen mail. But after learning with sa-learn
I thought bayes should increase over Bayes_50 for the same learned
message.

Most mails contain a number of hapaxes, one-off tokens that are never
seen again. If you train on a mail and then retest, hapaxes and other
rare tokens often skew the result to produce a positive match; this is
why sometimes a retest will score BAYES_99, but an almost identical spam
will hit BAYES_50.

On some retests the hapaxes don't dominate on retesting and the
probability stays close to .5. Like many such filters BAYES clusters
strongly around 0, 0.5 and 1. If it allowed you to retrain to
exhaustion (which it doesn't) you would probably see  several BAYES_50
results followed by a step change to BAYES_99.


Check that you haven't set "bayes_use_hapaxes 0". Otherwise if you are
seeing a lot of trained mails hit BAYES_50 on retesting (and I mean 10%
or so) you may have a mistrained database. If you only see a few, forget
about it.

There is no hapax option set.
When a spam isn't marked by spamassassin and bayes isn't bayes_99 I always train manually with sa-learn. And I think that I have always seen sa-learn making message going from bayes_X to bayes_99 when learning and restesting.

I do not remember  this  situation anytime.

Thanks.

Reply via email to