Re: A different approach to scoring spamassassin hits

Tom Allison Sat, 30 Jun 2007 18:34:30 -0700


On Jun 30, 2007, at 6:29 PM, Loren Wilton wrote:

And after typing all this I'm thinking you might be right. Butpart of this approach is to run all these rules in YES/NO fashionand see if the probability is significant. For example: If Itested for SOME_TEST=NO and found it was scoring a probability of~0.500 then it's indisputable that you are right.
Well, this still doesn't make any real sense to me; it seemsequivalent to the attempts at bayes poison that spammers stick intotheir spams: a bunch of words totally unrelated to the mail in thehopes of outweighing the useful terms. Now their trick works as agood spam indication because the words they pick aren't common tomy ham mails, so it is really a good spam indication rather thanpoison. I'm not immediately convinced that will hold for the usageyou intend. Maybe. Maybe not.
However, if you want to do this, remember that bayes works ontokens and has a tokenizer. So SOME_RULE=YES is probably eithertwo or three tokens, and you will end up scoring on the probabilityof YES and NO, along with the frequency of the rule names, whichwill be 1. So you probably want to do NO_SOME_RULE andYES_OTHER_RULE or the like when you build the insert list. Againthough I'm not sure I see the point in the yes and no factors; thepresence or absense of a word in the mail seems like a pretty goodyes/no indication to me.
Were I doing it I'd try it both ways and see if there is anydifference in results.

I agree with you that it's probably not going to be very effective touse a binary token (eg: SOME_RULE=YES vs SOME_RULE=NO) compared tothe presence of the rule (SOME_RULE exists implies SOME_RULE=YES).


So the method:
       $list = $status->get_names_of_tests_hit ()
may cover everything that is required to evaluate this approach.

Unfortunately I'm not on the SpamAssassin Bayes modules -- I wrote myown Bayes Engine because I wanted to do that and then thought aboutincluding the Rules results from SpamAssassin. I don't know wherethis might be going, but it seems to be working extremely well for mebased on a training set of just a couple hundred emails in total.

Re: A different approach to scoring spamassassin hits

Reply via email to