oddity?

Loek Cleophas Thu, 18 Feb 2010 02:45:53 -0800

Thank you Robin. The stack trace I got:

Exception in thread "main" java.lang.NullPointerException

atorg.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:100)atorg.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:117)atorg.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:122)atorg.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:88)atorg.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:63)atorg.apache.mahout.classifier.bayes.TestClassifier.classifySequential(TestClassifier.java:289)atorg.apache.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:204)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)atsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Command line was: bin/hadoop jar ~/Downloads/mahout-0.2/examples/target/mahout-examples-0.2.joborg.apache.mahout.classifier.bayes.TestClassifier -m docs-klg-n3-wordLevel-complementary -d ~/Code/klg/indextrainingvalidation/docs-klg-mahout-validate -ng 3 -type cbayes -source hdfs -method sequential

It did read the model in correctly - and when I substitute a non-existing input directory for the one with the non-category-named .txtfile, it indeed runs normally (classifying 0 instances).

I presume it should be easy to reproduce - if not, let me know and Ican see whether I can give you our small test data set or some smallsubset of it that I can reproduce it with.


Regards,
Loek

On Feb 18, 2010, at 11:25, Robin Anil wrote:

I will look into this.
On Thu, Feb 18, 2010 at 3:42 PM, Loek Cleophas <[email protected]>wrote:
Hi
While playing around some more with the 20newsgroups example codefor the
Bayes classifiers, I ran into an oddity and a presumable bug:
instead of using (parts of) the 20 newsgroups data set, which wassplitnicely into one file per newsgroup, with the 'category, tab,tokens' lineformat, I generated such a file out of our company data set. What Ididthough was generate 1 file to train, and 1 to test with - so bothfiles
could have different lines having different categories, e.g.

cars    Ferrari red ....
animals cow cat dog ....
In training, this works fine. In testing, it crashesTestClassifier with anull pointer exception. I presume that is because either the filename doesnot match category.txt for some category name, or because there'smultiplecategories being used inside the single file - but I also presumethatneither should crash the thing :) It also brings up the question:if theline format in the data files has the category in there, then whyare the
file names relevant at all? Seems like redundancy to me. Shouldn't
TestClassifier merely take all .txt files in the input datadirectory and
process their contents?

Regards,
Loek

Re: 20newsgroups example/TestClassifier code - bug/oddity?

Reply via email to