On Wed, Jan 4, 2012 at 4:59 PM, Lance Norskog <[email protected]> wrote:
> The last step before posting was to test it on SGD :) My results on > ASF mails (two labels) is around 80%, but both failure boxes get about > 20% of the messages. This seems more realistic. > Was this 80% on time-separated test data? Or the training data? > There is another leakage/spam problem in the dev mails: build reports. > Why are these a problem? Too easy? They are emails sent to the group and should be reasonable to classify unless they inflate the accuracy. > The MailProcessor has positive regex rules to find header entries & > subject lines. It does not do negative regex rules to reject a > message- this is the right way to nuke (the first) build message. > Yes. They should be easy to nuke. But I am not sure why. > Is it worthwhile to clamp the training data so that there are similar > numbers of documents for each label? Or does Naive Bayes work well > with a bell curve? > Shouldn't matter much for any of our classifiers. The only strong reason to do this is to speed up training but this data set is pretty small.
