The model size is much smaller with unigrams. :-)
I'm not quite sure what constitutes good just yet, but, I can report
the following using the commands I reported earlier w/ the exception
that I am using unigrams:
I have two categories: History and Science
0. Splitter:
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
--dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --
outputDir /PATH/wikipedia/chunks -c 64
Then prep:
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
--input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/
test --categories PATH/mahout-clean/examples/src/test/resources/
subjects.txt
(also do this for the training set)
1. Train set:
ls ../chunks
chunk-0001.xml chunk-0005.xml chunk-0009.xml chunk-0013.xml
chunk-0017.xml chunk-0021.xml chunk-0025.xml chunk-0029.xml
chunk-0033.xml chunk-0037.xml
chunk-0002.xml chunk-0006.xml chunk-0010.xml chunk-0014.xml
chunk-0018.xml chunk-0022.xml chunk-0026.xml chunk-0030.xml
chunk-0034.xml chunk-0038.xml
chunk-0003.xml chunk-0007.xml chunk-0011.xml chunk-0015.xml
chunk-0019.xml chunk-0023.xml chunk-0027.xml chunk-0031.xml
chunk-0035.xml chunk-0039.xml
chunk-0004.xml chunk-0008.xml chunk-0012.xml chunk-0016.xml
chunk-0020.xml chunk-0024.xml chunk-0028.xml chunk-0032.xml
chunk-0036.xml
2. Test Set:
ls
chunk-0101.xml chunk-0103.xml chunk-0105.xml chunk-0108.xml
chunk-0130.xml chunk-0132.xml chunk-0134.xml chunk-0137.xml
chunk-0102.xml chunk-0104.xml chunk-0107.xml chunk-0109.xml
chunk-0131.xml chunk-0133.xml chunk-0135.xml chunk-0139.xml
3. Run the Trainer on the train set:
--input PATH/wikipedia/subjects/out --output PATH/wikipedia/subjects/
model --gramSize 1 --classifierType bayes
4. Run the TestClassifier.
--model PATH/wikipedia/subjects/model --testDir PATH/wikipedia/
subjects/test --gramSize 1 --classifierType bayes
Output is:
<snip>
9/07/22 15:55:09 INFO bayes.TestClassifier:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 4143 74.0615%
Incorrectly Classified Instances : 1451 25.9385%
Total Classified Instances : 5594
=======================================================
Confusion Matrix
-------------------------------------------------------
a b <--Classified as
3910 186 | 4096 a = history
1265 233 | 1498 b = science
Default Category: unknown: 2
</snip>
At least it's better than 50%, which is presumably a good thing ;-) I
have no clue what the state of the art is these days, but it doesn't
seem _horrendous_ either.
I'd love to see someone validate what I have done. Let me know if you
need more details. I'd also like to know how I can improve it.
On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:
Indeed. I hadn't snapped to the fact you were using trigrams.
30 million features is quite plausible for that. To effectively use
long
n-grams as features in classification of documents you really need
to have
the following:
a) good statistical methods for resolving what is useful and what is
not.
Everybody here knows that my preference for a first hack is
sparsification
with log-likelihood ratios.
b) some kind of smoothing using smaller n-grams
c) some kind of smoothing over variants of n-grams.
AFAIK, mahout doesn't have many or any of these in place. You are
likely to
do better with unigrams as a result.
On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll
<[email protected]>wrote:
I suspect the explosion in the number of features, Ted, is due to
the use
of n-grams producing a lot of unique terms. I can try w/ gramSize
= 1, that
will likely reduce the feature set quite a bit.
--
Ted Dunning, CTO
DeepDyve