Amedee> I have noticed that a lot of spam contains disclaimer-ish text.
Amedee> If I train spambayes with "disclaimed" ham, I fear this will
Amedee> "pollute" the sb database. The result might be that any email
Amedee> with a disclaimer-ish text will get a relatively high ham score.
Amedee> At the moment, I don't see a solution for this possible problem.
Amedee> I *could* not train on disclaimed ham, but if most of my
Amedee> correspondents have such boilerplates, training spambayes won't
Amedee> be very efficient.
That depends. Most common English words (most of the words in disclaimers
are probably pretty common) should probably score around 0.5 and thus not be
used in ranking messages, e.g.:
spamcounts the only which that disclaimer property
token,nspam,nham,spam prob
the,3591,844,0.5
only,782,267,0.5
which,893,232,0.5
that,2111,424,0.5
disclaimer,2,1,0.352062362221
property,184,50,0.5
After you subtract all the common words, it depends on what's left worth
using. The approach SpamBayes uses is purely probabilistic (is
"statistical" more accurate?). The score of any given message is based the
"preponderance of evidence" contained in the non-trivial tokens the message
contains (or which SB synthesizes).
Skip
_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html