At 03:01 PM 8/24/2004, Ryan Sorensen wrote:
When I first set this system up I had never even touched linux before, so I just kinda threw it together with whatever FAQ I could find. I know this is wrong now, but I did absolutely no manual bayes training - I let it auto learn everything. You can see that my spam count is way higher than ham (bottom).

***Questions***
Am I better off deleting my database and starting over?

Probably. Most of your tokens are likely to be heavily poisoned if you've been doing autolearn-only and it's misclassifying email.


(Note: not all auto-learn only bayes DB's go bad.. but there's a definite risk of them getting off on the wrong foot and staying that way. The "no contradictions" autolearning rule makes the bayes database tend to stick to it's existing ideas and not fork off in new directions when autolearning. If it starts off right, it will tend to stay that way, if it starts of wrong, it will also tend to stay wrong.)


Or should I just start doing some manual training to try to correct the database?

You can try.. but you'll want to hand-train more email than it's already autolearned to try to flood out the problems.


Lastly, how do I get even spam and ham counts when autolearning and my incoming mail consists of 85% spam?

Don't try.. It's a completely wrong-headed idea to try to get these to be even.

Bayes is a statistical system. Statistical systems work best with realistic input, not "even numbers" input.

If 85% of your email is spam, 85% of your training should be spam, or at least this should be what you view as a "perfect" training ratio. Of course you can be quite considerably off from this ideal and be successful, but it's clearly a step in the wrong direction to try to force your training to 50/50.

Rather than focusing on what your training ratio is, focus on trying to make your training as realistic as possible without excessive work. (This should actually be easy, as it should realism should happen naturally. You have to intentionally try to make things unrealistic by manually changing ratios, eliminating certain emails from the training, etc.. )







Reply via email to