On Jun 30, 2007, at 6:29 PM, Loren Wilton wrote:
And after typing all this I'm thinking you might be right. But
part of this approach is to run all these rules in YES/NO fashion
and see if the probability is significant. For example: If I
tested for SOME_TEST=NO and found it was scoring a probability of
~0.500 then it's indisputable that you are right.
Well, this still doesn't make any real sense to me; it seems
equivalent to the attempts at bayes poison that spammers stick into
their spams: a bunch of words totally unrelated to the mail in the
hopes of outweighing the useful terms. Now their trick works as a
good spam indication because the words they pick aren't common to
my ham mails, so it is really a good spam indication rather than
poison. I'm not immediately convinced that will hold for the usage
you intend. Maybe. Maybe not.
However, if you want to do this, remember that bayes works on
tokens and has a tokenizer. So SOME_RULE=YES is probably either
two or three tokens, and you will end up scoring on the probability
of YES and NO, along with the frequency of the rule names, which
will be 1. So you probably want to do NO_SOME_RULE and
YES_OTHER_RULE or the like when you build the insert list. Again
though I'm not sure I see the point in the yes and no factors; the
presence or absense of a word in the mail seems like a pretty good
yes/no indication to me.
Were I doing it I'd try it both ways and see if there is any
difference in results.
I agree with you that it's probably not going to be very effective to
use a binary token (eg: SOME_RULE=YES vs SOME_RULE=NO) compared to
the presence of the rule (SOME_RULE exists implies SOME_RULE=YES).
So the method:
$list = $status->get_names_of_tests_hit ()
may cover everything that is required to evaluate this approach.
Unfortunately I'm not on the SpamAssassin Bayes modules -- I wrote my
own Bayes Engine because I wanted to do that and then thought about
including the Rules results from SpamAssassin. I don't know where
this might be going, but it seems to be working extremely well for me
based on a training set of just a couple hundred emails in total.