Thank you Robin. The stack trace I got:
Exception in thread "main" java.lang.NullPointerException
at
org
.apache
.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:100)
at
org
.apache
.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:
117)
at
org
.apache
.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:
122)
at
org
.apache
.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:88)
at
org
.apache
.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:63)
at
org
.apache
.mahout
.classifier
.bayes.TestClassifier.classifySequential(TestClassifier.java:289)
at
org
.apache
.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:204)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun
.reflect
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Command line was: bin/hadoop jar ~/Downloads/mahout-0.2/examples/
target/mahout-examples-0.2.job
org.apache.mahout.classifier.bayes.TestClassifier -m docs-klg-n3-
wordLevel-complementary -d ~/Code/klg/indextrainingvalidation/docs-klg-
mahout-validate -ng 3 -type cbayes -source hdfs -method sequential
It did read the model in correctly - and when I substitute a non-
existing input directory for the one with the non-category-named .txt
file, it indeed runs normally (classifying 0 instances).
I presume it should be easy to reproduce - if not, let me know and I
can see whether I can give you our small test data set or some small
subset of it that I can reproduce it with.
Regards,
Loek
On Feb 18, 2010, at 11:25, Robin Anil wrote:
I will look into this.
On Thu, Feb 18, 2010 at 3:42 PM, Loek Cleophas <[email protected]
>wrote:
Hi
While playing around some more with the 20newsgroups example code
for the
Bayes classifiers, I ran into an oddity and a presumable bug:
instead of using (parts of) the 20 newsgroups data set, which was
split
nicely into one file per newsgroup, with the 'category, tab,
tokens' line
format, I generated such a file out of our company data set. What I
did
though was generate 1 file to train, and 1 to test with - so both
files
could have different lines having different categories, e.g.
cars Ferrari red ....
animals cow cat dog ....
In training, this works fine. In testing, it crashes
TestClassifier with a
null pointer exception. I presume that is because either the file
name does
not match category.txt for some category name, or because there's
multiple
categories being used inside the single file - but I also presume
that
neither should crash the thing :) It also brings up the question:
if the
line format in the data files has the category in there, then why
are the
file names relevant at all? Seems like redundancy to me. Shouldn't
TestClassifier merely take all .txt files in the input data
directory and
process their contents?
Regards,
Loek