On 08/12/03 09:07 AM, Bart Schaefer sat at the `puter and typed:
> On Tue, 12 Aug 2003, Louis LeBlanc wrote:
> 
> > On 08/12/03 01:12 PM, [EMAIL PROTECTED] sat at the `puter and typed:
> > > 
> > > Would it be wise to sa-learn that message as ham? 
> > 
> > Nope.
> 
> Eh?  Of course it would be wise to learn the message as ham.  The more
> data the classifier has, the more accurate it becomes, assuming that the
> input is correct -- that is, that you're not telling it spam is ham and
> vice-versa.  Ideally you'd train it on every message you receive.

Yeah, and the next time a *real* spammer sends him a carefully worded
ad for Vigorex, his bayes db will have learned it as ham.  That
particular message will almost certainly never pass through his system
again, so why use the content to train bayes?

> There's a strong tendency to make value judgements ("oh, this is ham, but
> it looks so spammy I'd better not feed it to sa-learn, it'll just confuse
> the poor thing").  It doesn't work like that.

Although I could be wrong, I respectfully disagree.

Unless I'm mistaken, the tokens will be used to reduce or increase
their tendency to indicate spam.  Bayes will not learn from this
message that it's ok to get erectile dysfunction in a message so long
as it comes from this sender AND is accompanied by text referring to
lower interest rates.  So you really do want to play the numbers game
sometimes.  Personally, I train bayes every night on two of my inboxes
and my spam folder.  If this message got hit as a false positive, I'd
probably put it into the inbox to be trained as ham too, but I'd do it
with full knowledge that I may see an increase in penis pill ads
slipping through.  On the other hand, I might just read it and delete
it, then whitelist the sender.  After all, this is exactly the
scenario that whitelist feature was added for.

Whitelisting the sender ensures that whatever this newsletter
contains, it will not be tagged as spam in the future.  The increased
score due to the whitelist hit does not induce an autolearn (and if
I'm not mistaken, it will actually prevent autolearning - at least it
should), it just indicates that the message is not spam.

And you HAVE to make value judgements.  Keep in mind that the bayes
classifier is a PROGRAM, and it has no real ability to make fool proof
judgements.  It makes a best guess based on the info it is fed, and no
matter how good the program gets, until we get true AI checking our
email for spam, garbage in == garbage out.  Never give your program
data that will decrease its accuracy, just make allowances for
exceptions, like the SA developers did when they added a whitelist
feature in the first place.

Lou
-- 
Louis LeBlanc               [EMAIL PROTECTED]
Fully Funded Hobbyist, KeySlapper Extrordinaire :)
http://www.keyslapper.org                     ԿԬ

Research is what I'm doing when I don't know what I'm doing.
    -- Wernher von Braun


-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to