Robin, Any luck with this?
On Oct 11, 2011, at 7:22 AM, Robin Anil wrote: > I am guessing this is on the new naivebayes package. I would like to check > the data and compare against the old implementation if its a bug. > > On Tue, Oct 11, 2011 at 4:18 PM, Grant Ingersoll <gsing...@apache.org>wrote: > >> >> On Oct 11, 2011, at 1:47 AM, Robin Anil wrote: >> >>> Could be due to the way normalization is done. >> >> In what part of the process? >> >>> How is CNB performing? >> >> It's better, like 40% correct, 60% wrong, but still not good. >> >>> Do >>> share the confusion matrices and per label precision. >> >> Usually on the order of 0.05 correct, 95% wrong. >> >> If I bring the --maxItemsPerLabel (PrepEmailVectorsDriver) down to about >> 1000, then I get better results, but still not better than guessing. The >> main issue is that many of the mail archives have a ton of entries, but then >> a few only have less than 1000. On the flip side, 1000 is not really >> enough training wise. If I restrict down the input to mailing lists that >> have at least 10K items, then I get much better results. Of course, this is >> expected. The main issue is I don't understand why it would be picking the >> labels with the least amount of data. >> >>> >>> On Mon, Oct 10, 2011 at 11:20 PM, Grant Ingersoll <gsing...@apache.org >>> wrote: >>> >>>> I was trying the Naive Bayes classifier via the build-asf-email.sh file >> I >>>> committed the other day on a data set that had a fairly significant >>>> variation in the number of messages per training label and am noticing >>>> (still need to validate more) that the label with the least number of >>>> examples is often dominating the results. This seems counterintuitive >> to >>>> me. I would have expected the largest set would have dominated the >> results. >>>> If I even out the number of items per label, than I get reasonable >> results. >>>> Any thoughts on what I am seeing? If you are interested, I can share >> the >>>> details of the runs. >>>> >>>> -Grant >>>> >> >> -------------------------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com >> Lucene Eurocon 2011: http://www.lucene-eurocon.com >> >> -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com