On Tue, 7 Apr 2009, Jeff Rice wrote:
I'm wondering about the best training strategy for the bayes engine.
Most bayes classifiers seem to recommend that spam/ham be fed in either
alternating or random. SA seems to suggest that all of one type be
trained, and then all of the other type. In my experience with other
programs (CRM114, for example) this really hurts the accuracy.
What are your thoughts on this? I've been randomizing my spam/ham when
I train or retrain, but I don't have enough experience with SA to say if
this is beneficial, useless, or detrimental.
<knowitall>
I would say order of training is fairly meaningless as SA needs a minimum
of 200 of each before it starts scoring.
Train your 200 or more of each to get bayes started, then train FPs and
FNs as they happen.
Autolearning can be helpful in large userbases if you keep an eye on it -
it can magnify errors over time if you're not careful, and it's probably a
good idea to leave autolearn turned off initially during initial training
and until you get a feel for how things are being scored.
About all we recommend is keeping the ratio of ham:spam fairly balanced or
perhaps somewhat skewed towards learning more spam, as spam is (sadly) the
vast majority of most peoples' raw mail stream.
If you're manually training a large corpus, then the ham/spam order will
only matter during the time you've learned one and are working on learning
the other. That time window should be fairly short, and you should have
autolearn turned off while you're doing that. In fact, you might want to
temporarily disable bayes if you're going to be manually training a large
corpus.
</knowitall>
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Vista "security improvements" consist of attempting to shift blame
onto the user when things go wrong.
-----------------------------------------------------------------------
6 days until Thomas Jefferson's 266th Birthday