On Tue, 7 Apr 2009, Jeff Rice wrote:

I'm wondering about the best training strategy for the bayes engine. Most bayes classifiers seem to recommend that spam/ham be fed in either alternating or random. SA seems to suggest that all of one type be trained, and then all of the other type. In my experience with other programs (CRM114, for example) this really hurts the accuracy.

What are your thoughts on this? I've been randomizing my spam/ham when I train or retrain, but I don't have enough experience with SA to say if this is beneficial, useless, or detrimental.

<knowitall>

I would say order of training is fairly meaningless as SA needs a minimum of 200 of each before it starts scoring.

Train your 200 or more of each to get bayes started, then train FPs and FNs as they happen.

Autolearning can be helpful in large userbases if you keep an eye on it - it can magnify errors over time if you're not careful, and it's probably a good idea to leave autolearn turned off initially during initial training and until you get a feel for how things are being scored.

About all we recommend is keeping the ratio of ham:spam fairly balanced or perhaps somewhat skewed towards learning more spam, as spam is (sadly) the vast majority of most peoples' raw mail stream.

If you're manually training a large corpus, then the ham/spam order will only matter during the time you've learned one and are working on learning the other. That time window should be fairly short, and you should have autolearn turned off while you're doing that. In fact, you might want to temporarily disable bayes if you're going to be manually training a large corpus.

</knowitall>

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Vista "security improvements" consist of attempting to shift blame
  onto the user when things go wrong.
-----------------------------------------------------------------------
 6 days until Thomas Jefferson's 266th Birthday

Reply via email to