On Jun 30, 2007, at 2:55 PM, Bart Schaefer wrote:


On 6/29/07, Tom Allison <[EMAIL PROTECTED]> wrote:

The thought I had, and have been working on for a while, is changing
how the scoring is done.  Rather than making Bayes a part of the
scoring process, make the scoring process a part of the Bayes
statistical Engine.  As an example you would simply feed into the
Bayesian process, as tokens, the indications of scoring hits (binary
yes/no) would be examined next to the other tokens in the message.

There are a few problems with this.

(1) It assumes that Bayesian (or similar) classification is more
accurate than SA's scoring system.  Either that, or you're willing to
give up accuracy in the name of removing all those confusing knobs you
don't want to touch, but it would seem to me to be better to have the
knobs and just not touch them.

I know that without SA you can have >99.9% accuracy with pure bayesian classification. But there are specific non Bayes things that are made visible through spamassassin rules that a typical bayes process can't catch (very well or at all). The whole issue of "knobs" is moot under a statistical approach because each users scoring will determine the real importance of each particular rule hit.

(2) For many SA rules you would be, in effect, double-counting some
tokens.  An SA scoring rule that matches a phrase, for example, is
effectively matching a collection of tokens that are also being fed
individually to the Bayes engine.  In theory, you should not
second-guess the system by passing such compound tokens to Bayes;
instead it should be allowed to learn what combinations of tokens are
meaningful when they appear together.

Bayes does not match a phrase, only words. At least that is what most Bayes filters do. There are some approaches that do use multiple words, but not a "phrase". Therefore I think the intersection of Bayes and Spamassassin rules is going to be small.

(It might be worthwhile, though, to e.g. add tokens that are not
otherwise present in the message, such as for the results of network
tests.)

This is what I'm interested in and mentioned in paragraph one. There are a lot of things you can do with SpamAssassin that just Bayes will never do. It is exactly this type of work that I think would be most interesting to pursue.

(3) It introduces a bootstrapping problem, as has already been noted.
Everyone has to train the engine and re-train it when new rules are
developed.

I've thought of a few more, but they all have to do with the benifits
of having all those "knobs" and if you've already adopted the basic
premise that they should be removed there doesn't seem to be any
reason to argue that part.

To summarize my opinion:  If what you want is to have a Bayesian-type
engine make all the decisions, then you should install a Bayesian
engine and work on ways to feed it the right tokens; you should not
install SpamAssassin and then work on ways to remove the scoring.

It makes sense to do this approach. However it would not make sense to try and reinvent the fantastic amount of useful work that has come from SpamAssassin. That would take a very long time to address. SpamAssassin has some really great ways of finding the right tokens. Why would I consider trying to duplicate all that effort.

Reply via email to