Insights to Naive Bayes classifier example - 20news groups

Jakub Stransky Mon, 01 Dec 2014 08:11:34 -0800

Hello Mahout experts,

I am trying to follow some examples provided with Mahout and some features
are not clear to me. It would be great if someone could clarify a bit more.


To prepare a the data (train and test) the following sequence of steps is
perfomed (taken from mahout cookbook):

All input is merged into single dir:
*cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*

Converted to hadoop sequence file and then vectorized:
*./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-**vectors
-lnorm -nv -wt tfidf*

Devided to test and train data:
*./mahout split*
*-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
*--trainingOutput ${WORK_DIR}/20news-train-vectors*
*--testOutput ${WORK_DIR}/20news-test-vectors*
*--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*

Model is trained:
*./mahout trainnb*
*-i ${WORK_DIR}/20news-train-vectors -el*
*-o ${WORK_DIR}/model*
*-li ${WORK_DIR}/labelindex*
*-ow*


What I am missing here and that is subject of my question is: Where is the
category assigned to the testing data to train the categorization? What I
would expect is that there will be vector which says that this document
belongs to a particular category. This seems to me has been ereased by
first step where we mixed all the data to create our corpus. I would still
expect that this information will be somewhere retained. Instead the
messages looks as follows:

From: y...@a.cs.okstate.edu (YEO YEK CHONG)
Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
Organization: Oklahoma State University
Lines: 7

>From article <a4fm3b1w1...@vicuna.ocunix.on.ca>, by Steve Frampton <
framp...@vicuna.ocunix.on.ca>:
> I was wondering, is the "Kermit" package (the actual package, not a

Yes!  In the usual ftp sites.

Yek CHong


There is no notion from which group this text belongs to. What's the hack!

Could someone please clarify a bit what's going on as when crosswalidation
is performed - confusion matrix takes into consideration those categories.

Thanks a lot for helping me out
Jakub

Insights to Naive Bayes classifier example - 20news groups

Reply via email to