Re: About Training ( sa-learn )

Bowie Bailey Thu, 04 Mar 2010 09:41:33 -0800

(Please send replies to the list)

Henrique Fernandes wrote:
>
> On Thu, Mar 4, 2010 at 2:22 PM, Bowie Bailey <bowie_bai...@buc.com
> <mailto:bowie_bai...@buc.com>> wrote:
>
>     Henrique Fernandes wrote:
>     > Nops, i wnat that after i trained, the same email, should get a
>     higher
>     > score cause the spamassassin was trained that is a spam, so when it
>     > comes again , it should look in the database and add some extra
>     point
>     > on the score right ?
>
>     That is a fairly common misconception.  When you learn an email as
>     spam,
>     the Bayes system breaks it into tokens (words/character strings) and
>     then makes a note that each of those tokens was seen in a spam.
>      When an
>     email comes in, it breaks up the new email into tokens and then checks
>     to see how frequently each of those tokens was previously seen in spam
>     or ham.  Based on what it finds, it ranks the email from BAYES_00
>     (very
>     unlikely to be spam) to BAYES_99 (almost certainly spam).
>
>     Since learning from a single email only adds one data point to each
>     token, it is unlikely to make a major difference on its own.  The
>     value
>     comes in learning from lots of spam and ham.  This is why the Bayes
>     rules will not run until you have learned from at least 200 ham
>     and 200
>     spam.
>
>
> hmm
>
> Thanks, so ech individual user has to have learned lots of emails so
> after that they will start to have an difference on score ?


Yes. Each individual user will need to learn at least 200 ham and 200
spam (manually or via auto-learn) before Bayes will start scoring.  The
more they learn, the better the accuracy.

> So is better to just traing one database to all user instead one base
> for each user ?
>
> Making just one base i am afraid of getting to many false-positives.
> Cause sometimes Viagra is not spam for some one that researhc it, but
> if it is in the same base, it will be marked as spam...

Depends on your users.  Unless they are wildly different, a single
database should work fairly well.  Individual databases can be more
accurate in some instances, but a single well-trained database will
probably work better than a bunch of individual databases that are not
trained consistently.

-- 
Bowie

Re: About Training ( sa-learn )

Reply via email to