On 6/29/07, Tom Allison <[EMAIL PROTECTED]> wrote:

The thought I had, and have been working on for a while, is changing
how the scoring is done.  Rather than making Bayes a part of the
scoring process, make the scoring process a part of the Bayes
statistical Engine.  As an example you would simply feed into the
Bayesian process, as tokens, the indications of scoring hits (binary
yes/no) would be examined next to the other tokens in the message.

There are a few problems with this.

(1) It assumes that Bayesian (or similar) classification is more
accurate than SA's scoring system.  Either that, or you're willing to
give up accuracy in the name of removing all those confusing knobs you
don't want to touch, but it would seem to me to be better to have the
knobs and just not touch them.

(2) For many SA rules you would be, in effect, double-counting some
tokens.  An SA scoring rule that matches a phrase, for example, is
effectively matching a collection of tokens that are also being fed
individually to the Bayes engine.  In theory, you should not
second-guess the system by passing such compound tokens to Bayes;
instead it should be allowed to learn what combinations of tokens are
meaningful when they appear together.

(It might be worthwhile, though, to e.g. add tokens that are not
otherwise present in the message, such as for the results of network
tests.)

(3) It introduces a bootstrapping problem, as has already been noted.
Everyone has to train the engine and re-train it when new rules are
developed.

I've thought of a few more, but they all have to do with the benifits
of having all those "knobs" and if you've already adopted the basic
premise that they should be removed there doesn't seem to be any
reason to argue that part.

To summarize my opinion:  If what you want is to have a Bayesian-type
engine make all the decisions, then you should install a Bayesian
engine and work on ways to feed it the right tokens; you should not
install SpamAssassin and then work on ways to remove the scoring.

Reply via email to