Re: [jira] [Updated] (MAHOUT-939) ASF Email SGD Examples don't produce good results

Ted Dunning Wed, 04 Jan 2012 18:30:47 -0800

On Wed, Jan 4, 2012 at 4:59 PM, Lance Norskog <[email protected]> wrote:


> The last step before posting was to test it on SGD :) My results on
> ASF mails (two labels) is around 80%, but both failure boxes get about
> 20% of the messages. This seems more realistic.
>

Was this 80% on time-separated test data?  Or the training data?


> There is another leakage/spam problem in the dev mails: build reports.
>

Why are these a problem?  Too easy?

They are emails sent to the group and should be reasonable to classify
unless they inflate the accuracy.


> The MailProcessor has positive regex rules to find header entries &
> subject lines. It does not do negative regex rules to reject a
> message- this is the right way to nuke (the first) build message.
>

Yes.  They should be easy to nuke.  But I am not sure why.


> Is it worthwhile to clamp the training data so that there are similar
> numbers of documents for each label? Or does Naive Bayes work well
> with a bell curve?
>

Shouldn't matter much for any of our classifiers.

The only strong reason to do this is to speed up training but this data set
is pretty small.

Re: [jira] [Updated] (MAHOUT-939) ASF Email SGD Examples don't produce good results

Reply via email to