Hi
I've recently started working with Mahout. At first, I tried the
trunk, which I got to compile (both from within Eclipse with a Maven
plugin, and command line), but which apparently is in a state of flux
regarding building and running the examples (?).
I tried running the Twentynewsgroups classification example, after
copying the relevant Maven file to the examples directory, as
suggested on the mailing list some time ago. I could get the example's
data set from wikipedia, could get it processed into input data
located on the single-node/local hdfs, and could get a model trained
and output to that hdfs. However, the example class TestClassifierto
test with the trained model didn't work for me, neither in mapreduce
nor in sequential mode. In the mapreduce case, and even with quite
high JVM maximum heap sizes (I tried 2048), I get heapspace out of
memory errors / object configuration errors. In the sequential case, I
seemingly get 0 items classified, see output below. (Note that I
reduced the data set to just 8 instead of 20 newsgroups, thinking the
data size might have something to do with the problem.)
I also tried release 0.2, which I got to compile and for which I got
the example running more easily, but still with the same errors when
testing with the trained model. Any ideas what might be going wrong,
or what I might be doing wrong?
Kind regards,
Loek Cleophas
Output of TestClassifier:
bin/hadoop jar ~/Downloads/mahout-0.2/examples/target/mahout-
examples-0.2.job org.apache.mahout.classifier.bayes.TestClassifier -m
8newsmodel-0.2 -d 8newsInput -ng 3 -type bayes -source hdfs -method
sequential
<... reading all the feature weights ...>
10/01/13 10:22:08 INFO io.SequenceFileModelReader: Read 1950000
feature weights
10/01/13 10:22:11 INFO io.SequenceFileModelReader: hdfs://localhost:
9000/user/loekcleophas/8newsmodel-0.2/trainer-weights/Sigma_k/part-00000
10/01/13 10:22:11 INFO io.SequenceFileModelReader: hdfs://localhost:
9000/user/loekcleophas/8newsmodel-0.2/trainer-weights/Sigma_kSigma_j/
part-00000
10/01/13 10:22:11 INFO io.SequenceFileModelReader: 420716.6056712613
10/01/13 10:22:11 INFO io.SequenceFileModelReader: hdfs://localhost:
9000/user/loekcleophas/8newsmodel-0.2/trainer-thetaNormalizer/part-00000
10/01/13 10:22:11 INFO io.SequenceFileModelReader: hdfs://localhost:
9000/user/loekcleophas/8newsmodel-0.2/trainer-tfIdf/trainer-tfIdf/
part-00000
comp.windows.x -4443829.798557077 7727496.583973498 -0.5750671967650419
comp.graphics -3252365.124498224 7727496.583973498 -0.4208821174044246
soc.religion.christian -5106741.34456479 7727496.583973498
-0.6608532645819548
alt.atheism -3447983.6168798 7727496.583973498 -0.44619671835646907
misc.forsale -2276588.3662840202 7727496.583973498 -0.2946087832643716
comp.sys.mac.hardware -2445489.855812473 7727496.583973498
-0.31646598988918556
comp.os.ms-windows.misc -7727496.583973498 7727496.583973498 -1.0
comp.sys.ibm.pc.hardware -2687646.590023761 7727496.583973498
-0.3478030123750332
10/01/13 10:23:17 INFO bayes.TestClassifier:
nCalls = 0;
sumTime = 0.0s;
minTime = 0.0ms;
maxTime = 0.0ms;
meanTime = 0.0ms;
stdDevTime = 0.0ms;
10/01/13 10:23:18 INFO bayes.TestClassifier:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 0 ?%
Incorrectly Classified Instances : 0 ?%
Total Classified Instances : 0
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h <--Classified as
0 0 0 0 0 0 0 0 | 0 a =
comp.windows.x
0 0 0 0 0 0 0 0 | 0 b =
comp.graphics
0 0 0 0 0 0 0 0 | 0 c =
soc.religion.christian
0 0 0 0 0 0 0 0 | 0 d =
alt.atheism
0 0 0 0 0 0 0 0 | 0 e =
misc.forsale
0 0 0 0 0 0 0 0 | 0 f =
comp.sys.mac.hardware
0 0 0 0 0 0 0 0 | 0 g =
comp.os.ms-windows.misc
0 0 0 0 0 0 0 0 | 0 h =
comp.sys.ibm.pc.hardware
Default Category: unknown: 8