On Oct 11, 2011, at 1:47 AM, Robin Anil wrote:

> Could be due to the way normalization is done.

In what part of the process?

> How is CNB performing?

It's better, like 40% correct, 60% wrong, but still not good.

> Do
> share the confusion matrices and per label precision.

Usually on the order of 0.05 correct, 95% wrong.

If I bring the --maxItemsPerLabel (PrepEmailVectorsDriver) down to about 1000, 
then I get better results, but still not better than guessing.  The main issue 
is that many of the mail archives have a ton of entries, but then a few only 
have less than 1000.    On the flip side, 1000 is not really enough training 
wise.  If I restrict down the input to mailing lists that have at least 10K 
items, then I get much better results.  Of course, this is expected.  The main 
issue is I don't understand why it would be picking the labels with the least 
amount of data.

> 
> On Mon, Oct 10, 2011 at 11:20 PM, Grant Ingersoll <gsing...@apache.org>wrote:
> 
>> I was trying the Naive Bayes classifier via the build-asf-email.sh file I
>> committed the other day on a data set that had a fairly significant
>> variation in the number of messages per training label and am noticing
>> (still need to validate more) that the label with the least number of
>> examples is often dominating the results.  This seems counterintuitive to
>> me.  I would have expected the largest set would have dominated the results.
>> If I even out the number of items per label, than I get reasonable results.
>> Any thoughts on what I am seeing?  If you are interested, I can share the
>> details of the runs.
>> 
>> -Grant
>> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Reply via email to