On 6/29/07, Tom Allison <[EMAIL PROTECTED]> wrote:
The thought I had, and have been working on for a while, is changing how the scoring is done. Rather than making Bayes a part of the scoring process, make the scoring process a part of the Bayes statistical Engine. As an example you would simply feed into the Bayesian process, as tokens, the indications of scoring hits (binary yes/no) would be examined next to the other tokens in the message.
There are a few problems with this. (1) It assumes that Bayesian (or similar) classification is more accurate than SA's scoring system. Either that, or you're willing to give up accuracy in the name of removing all those confusing knobs you don't want to touch, but it would seem to me to be better to have the knobs and just not touch them. (2) For many SA rules you would be, in effect, double-counting some tokens. An SA scoring rule that matches a phrase, for example, is effectively matching a collection of tokens that are also being fed individually to the Bayes engine. In theory, you should not second-guess the system by passing such compound tokens to Bayes; instead it should be allowed to learn what combinations of tokens are meaningful when they appear together. (It might be worthwhile, though, to e.g. add tokens that are not otherwise present in the message, such as for the results of network tests.) (3) It introduces a bootstrapping problem, as has already been noted. Everyone has to train the engine and re-train it when new rules are developed. I've thought of a few more, but they all have to do with the benifits of having all those "knobs" and if you've already adopted the basic premise that they should be removed there doesn't seem to be any reason to argue that part. To summarize my opinion: If what you want is to have a Bayesian-type engine make all the decisions, then you should install a Bayesian engine and work on ways to feed it the right tokens; you should not install SpamAssassin and then work on ways to remove the scoring.