Yeah. It definitely shouldn't be. I will post a fix soon(I am at work right now). Meanwhile, You can see the test classifier code, and programmatically run the classifier. its as easy as setting the params and instantiating a classifier context and send it files one by one.
Robin On Thu, Feb 18, 2010 at 4:15 PM, Loek Cleophas <[email protected]>wrote: > Thank you Robin. The stack trace I got: > > Exception in thread "main" java.lang.NullPointerException > at > org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:100) > at > org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:117) > at > org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:122) > at > org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:88) > at > org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:63) > at > org.apache.mahout.classifier.bayes.TestClassifier.classifySequential(TestClassifier.java:289) > at > org.apache.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:204) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > Command line was: bin/hadoop jar > ~/Downloads/mahout-0.2/examples/target/mahout-examples-0.2.job > org.apache.mahout.classifier.bayes.TestClassifier -m > docs-klg-n3-wordLevel-complementary -d > ~/Code/klg/indextrainingvalidation/docs-klg-mahout-validate -ng 3 -type > cbayes -source hdfs -method sequential > > It did read the model in correctly - and when I substitute a non-existing > input directory for the one with the non-category-named .txt file, it indeed > runs normally (classifying 0 instances). > > I presume it should be easy to reproduce - if not, let me know and I can > see whether I can give you our small test data set or some small subset of > it that I can reproduce it with. > > Regards, > Loek > > > On Feb 18, 2010, at 11:25, Robin Anil wrote: > > I will look into this. >> >> On Thu, Feb 18, 2010 at 3:42 PM, Loek Cleophas <[email protected] >> >wrote: >> >> Hi >>> >>> While playing around some more with the 20newsgroups example code for the >>> Bayes classifiers, I ran into an oddity and a presumable bug: >>> >>> instead of using (parts of) the 20 newsgroups data set, which was split >>> nicely into one file per newsgroup, with the 'category, tab, tokens' line >>> format, I generated such a file out of our company data set. What I did >>> though was generate 1 file to train, and 1 to test with - so both files >>> could have different lines having different categories, e.g. >>> >>> cars Ferrari red .... >>> animals cow cat dog .... >>> >>> In training, this works fine. In testing, it crashes TestClassifier with >>> a >>> null pointer exception. I presume that is because either the file name >>> does >>> not match category.txt for some category name, or because there's >>> multiple >>> categories being used inside the single file - but I also presume that >>> neither should crash the thing :) It also brings up the question: if the >>> line format in the data files has the category in there, then why are the >>> file names relevant at all? Seems like redundancy to me. Shouldn't >>> TestClassifier merely take all .txt files in the input data directory and >>> process their contents? >>> >>> Regards, >>> Loek >>> >>> >
