Re: Getting Started with Classification

Grant Ingersoll Wed, 22 Jul 2009 13:05:52 -0700

The model size is much smaller with unigrams.  :-)

I'm not quite sure what constitutes good just yet, but, I can reportthe following using the commands I reported earlier w/ the exceptionthat I am using unigrams:


I have two categories:  History and Science

0. Splitter:
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter

--dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --outputDir /PATH/wikipedia/chunks -c 64


Then prep:
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver

--input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/test --categories PATH/mahout-clean/examples/src/test/resources/subjects.txt

(also do this for the training set)

1. Train set:
ls ../chunks

chunk-0001.xml chunk-0005.xml chunk-0009.xml chunk-0013.xmlchunk-0017.xml chunk-0021.xml chunk-0025.xml chunk-0029.xmlchunk-0033.xml chunk-0037.xmlchunk-0002.xml chunk-0006.xml chunk-0010.xml chunk-0014.xmlchunk-0018.xml chunk-0022.xml chunk-0026.xml chunk-0030.xmlchunk-0034.xml chunk-0038.xmlchunk-0003.xml chunk-0007.xml chunk-0011.xml chunk-0015.xmlchunk-0019.xml chunk-0023.xml chunk-0027.xml chunk-0031.xmlchunk-0035.xml chunk-0039.xmlchunk-0004.xml chunk-0008.xml chunk-0012.xml chunk-0016.xmlchunk-0020.xml chunk-0024.xml chunk-0028.xml chunk-0032.xmlchunk-0036.xml


2. Test Set:
 ls

chunk-0101.xml chunk-0103.xml chunk-0105.xml chunk-0108.xmlchunk-0130.xml chunk-0132.xml chunk-0134.xml chunk-0137.xmlchunk-0102.xml chunk-0104.xml chunk-0107.xml chunk-0109.xmlchunk-0131.xml chunk-0133.xml chunk-0135.xml chunk-0139.xml


3. Run the Trainer on the train set:

--input PATH/wikipedia/subjects/out --output PATH/wikipedia/subjects/model --gramSize 1 --classifierType bayes


4. Run the TestClassifier.

--model PATH/wikipedia/subjects/model --testDir PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes


Output is:

<snip>

9/07/22 15:55:09 INFO bayes.TestClassifier:=======================================================

Summary
-------------------------------------------------------
Correctly Classified Instances          :       4143       74.0615%
Incorrectly Classified Instances        :       1451       25.9385%
Total Classified Instances              :       5594

=======================================================
Confusion Matrix
-------------------------------------------------------
a       b       <--Classified as
3910    186      |  4096        a     = history
1265    233      |  1498        b     = science
Default Category: unknown: 2
</snip>

At least it's better than 50%, which is presumably a good thing ;-) Ihave no clue what the state of the art is these days, but it doesn'tseem _horrendous_ either.

I'd love to see someone validate what I have done. Let me know if youneed more details. I'd also like to know how I can improve it.


On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:

Indeed.  I hadn't snapped to the fact you were using trigrams.
30 million features is quite plausible for that. To effectively uselongn-grams as features in classification of documents you really needto have
the following:
a) good statistical methods for resolving what is useful and what isnot.Everybody here knows that my preference for a first hack issparsification
with log-likelihood ratios.

b) some kind of smoothing using smaller n-grams

c) some kind of smoothing over variants of n-grams.
AFAIK, mahout doesn't have many or any of these in place. You arelikely to
do better with unigrams as a result.
On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll<[email protected]>wrote:
I suspect the explosion in the number of features, Ted, is due tothe useof n-grams producing a lot of unique terms. I can try w/ gramSize= 1, that
will likely reduce the feature set quite a bit.
--
Ted Dunning, CTO
DeepDyve

Re: Getting Started with Classification

Reply via email to