The model size is much smaller with unigrams.  :-)

I'm not quite sure what constitutes good just yet, but, I can report the following using the commands I reported earlier w/ the exception that I am using unigrams:

I have two categories:  History and Science

0. Splitter:
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
--dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml -- outputDir /PATH/wikipedia/chunks -c 64

Then prep:
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
--input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/ test --categories PATH/mahout-clean/examples/src/test/resources/ subjects.txt
(also do this for the training set)

1. Train set:
ls ../chunks
chunk-0001.xml chunk-0005.xml chunk-0009.xml chunk-0013.xml chunk-0017.xml chunk-0021.xml chunk-0025.xml chunk-0029.xml chunk-0033.xml chunk-0037.xml chunk-0002.xml chunk-0006.xml chunk-0010.xml chunk-0014.xml chunk-0018.xml chunk-0022.xml chunk-0026.xml chunk-0030.xml chunk-0034.xml chunk-0038.xml chunk-0003.xml chunk-0007.xml chunk-0011.xml chunk-0015.xml chunk-0019.xml chunk-0023.xml chunk-0027.xml chunk-0031.xml chunk-0035.xml chunk-0039.xml chunk-0004.xml chunk-0008.xml chunk-0012.xml chunk-0016.xml chunk-0020.xml chunk-0024.xml chunk-0028.xml chunk-0032.xml chunk-0036.xml

2. Test Set:
 ls
chunk-0101.xml chunk-0103.xml chunk-0105.xml chunk-0108.xml chunk-0130.xml chunk-0132.xml chunk-0134.xml chunk-0137.xml chunk-0102.xml chunk-0104.xml chunk-0107.xml chunk-0109.xml chunk-0131.xml chunk-0133.xml chunk-0135.xml chunk-0139.xml

3. Run the Trainer on the train set:
--input PATH/wikipedia/subjects/out --output PATH/wikipedia/subjects/ model --gramSize 1 --classifierType bayes

4. Run the TestClassifier.

--model PATH/wikipedia/subjects/model --testDir PATH/wikipedia/ subjects/test --gramSize 1 --classifierType bayes

Output is:

<snip>
9/07/22 15:55:09 INFO bayes.TestClassifier: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       4143       74.0615%
Incorrectly Classified Instances        :       1451       25.9385%
Total Classified Instances              :       5594

=======================================================
Confusion Matrix
-------------------------------------------------------
a       b       <--Classified as
3910    186      |  4096        a     = history
1265    233      |  1498        b     = science
Default Category: unknown: 2
</snip>

At least it's better than 50%, which is presumably a good thing ;-) I have no clue what the state of the art is these days, but it doesn't seem _horrendous_ either.

I'd love to see someone validate what I have done. Let me know if you need more details. I'd also like to know how I can improve it.

On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:

Indeed.  I hadn't snapped to the fact you were using trigrams.

30 million features is quite plausible for that. To effectively use long n-grams as features in classification of documents you really need to have
the following:

a) good statistical methods for resolving what is useful and what is not. Everybody here knows that my preference for a first hack is sparsification
with log-likelihood ratios.

b) some kind of smoothing using smaller n-grams

c) some kind of smoothing over variants of n-grams.

AFAIK, mahout doesn't have many or any of these in place. You are likely to
do better with unigrams as a result.

On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <[email protected]>wrote:

I suspect the explosion in the number of features, Ted, is due to the use of n-grams producing a lot of unique terms. I can try w/ gramSize = 1, that
will likely reduce the feature set quite a bit.




--
Ted Dunning, CTO
DeepDyve


Reply via email to