Hi Jakub, The step that you are missing is `$mahout seqdir ...`. in this step each file in each directory (where the directory is the Category) is converted into a sequence file of form <Text,Text> where the Text key is /Category/doc_id.
`$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...` into a sequence file of form <Text, VectorWritable> leaving the Keys unchanged. `$mahout trainnb ... -el ...` then extracts the label from the Keys of the training data ie. the "Category" from /Category/doc_id. please see http://mahout.apache.org/users/classification/twenty-newsgroups.html and http://mahout.apache.org/users/classification/bayesian.html for more information. > Date: Mon, 1 Dec 2014 17:09:55 +0100 > Subject: Insights to Naive Bayes classifier example - 20news groups > From: stransky...@gmail.com > To: user@mahout.apache.org > > Hello Mahout experts, > > I am trying to follow some examples provided with Mahout and some features > are not clear to me. It would be great if someone could clarify a bit more. > > To prepare a the data (train and test) the following sequence of steps is > perfomed (taken from mahout cookbook): > > All input is merged into single dir: > *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all* > > Converted to hadoop sequence file and then vectorized: > *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-**vectors > -lnorm -nv -wt tfidf* > > Devided to test and train data: > *./mahout split* > *-i ${WORK_DIR}/20news-vectors/tfidf-vectors* > *--trainingOutput ${WORK_DIR}/20news-train-vectors* > *--testOutput ${WORK_DIR}/20news-test-vectors* > *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential* > > Model is trained: > *./mahout trainnb* > *-i ${WORK_DIR}/20news-train-vectors -el* > *-o ${WORK_DIR}/model* > *-li ${WORK_DIR}/labelindex* > *-ow* > > > What I am missing here and that is subject of my question is: Where is the > category assigned to the testing data to train the categorization? What I > would expect is that there will be vector which says that this document > belongs to a particular category. This seems to me has been ereased by > first step where we mixed all the data to create our corpus. I would still > expect that this information will be somewhere retained. Instead the > messages looks as follows: > > From: y...@a.cs.okstate.edu (YEO YEK CHONG) > Subject: Re: Is "Kermit" available for Windows 3.0/3.1? > Organization: Oklahoma State University > Lines: 7 > > From article <a4fm3b1w1...@vicuna.ocunix.on.ca>, by Steve Frampton < > framp...@vicuna.ocunix.on.ca>: > > I was wondering, is the "Kermit" package (the actual package, not a > > Yes! In the usual ftp sites. > > Yek CHong > > > There is no notion from which group this text belongs to. What's the hack! > > Could someone please clarify a bit what's going on as when crosswalidation > is performed - confusion matrix takes into consideration those categories. > > Thanks a lot for helping me out > Jakub