Hi Bill,

On Mon, 30 Aug 2004 20:38:28 -0400 (EDT), William Stearns wrote:
> Good evening, Mat,
> (For reference, I prefer to continue discussion on the public
> mailing list; if the topic is interesting enough to bring up,
> perhaps
> others will be interested to hear the ongoing discussion as well.
> Had you
> intended to remove the SA mailing list from CC?  If this was an
> accident,
> please add the list back in...)

My apologies for emailing you directly. I meant to send it directly to
the list and resent it to the list after i realised my mistake.

>> Yes that would do what i was suggesting, but why not do that from
>> the
>> start, why assign scores to these tests at all? The reasoning
>> behind
>> the question is that Bayes will be better suited to judge how to
>> score
>> these tests based on individual users spam/ham. For example, some
>> tests blacklist yahoo.com, obviously for a user with lots of
>> contacts
>> who use yahoo.com will want their scoring to adapt to give a lower
>> score to this test, as classification by Bayes would do.
>> Does that make any sense? What do you think?
> I'll grant you that a well-tuned bayes and AWL (auto-whitelist)
> will do a good job of handling a particular users' mail flows.
> There are
> some bayes-only spam filters that are a testament to that fact.
> However, it misses one _critical_ fact in spam filtering, and one
> seemingly minor, but critical adjective in my first sentence.
> The critical fact is this:
> Every single spam identification technique can screw up.
> Whether it's false positives or false negatives, _all_ techniques
> can make mistakes in classifying spam.  This is why Spamassassin
> does so
> well - it's not limited to a few techniques, but rather uses a wide
> range.
> If one of them misclassifies mail, the others compensate, and we
> hopefully
> end up with a correct classification far more often than if we
> depended
> on just a few characteristics.
> The critical adjective is "well-tuned".  A good bayes setup
> (whether the Bayes that's part of SA or any other spam filter)
> requires a
> good deal of hand-classified mail to get going, and regular mistake
> training.  While it'll do a better job for that particular user,
> it's much
> more hands-on than the hardcoded rules.
> If you really want to picture SA as Bayes plus some other stuff,
> that's just fine.  You can consider the hard-coded phrase rules
> (and all
> the other rules) as nothing more than clues to get SA going while
> you're
> hand training the Bayes.
> That mindset, though, ignores some other very valuable components
> of the overall score (and not just the ones to which I contribute ;-
> ) .

That sounds like good reasoning and i certainly can't fault the fact
that spamassassin does a great job just as it is!

However, if i can play devils advocate to your reasoning: I'm not sure
how much more hands on training Bayes is than any other part of
configuring SpamAssassin. Depending on the particular setup, training
Bayes can be as simple as clicking a button in your email client,
whereas fixing non-functional rules requires at least some working
knowledge of how SpamAssassin works.

The thing about hard coded rules is that, as this list has
demonstrated, they either break with time or their value changes with
time or they're well suited to one user and not well suited to another
user (eg. my example with yahoo.com blacklisting in another posting).
Solely using Bayes to give your score doesn't mean that these rules
can't give clues to the final outcome, since Bayes will itself pick up
on the results of these tests, but using Bayes will taylor them better
to the individual user - at least that's my devils advocates
reasoning!

Kind regards, and my apologies again for the off-list email,
Mat

Reply via email to