Hi Jakub,

The step that you are missing is `$mahout seqdir ...`.   in this step each file 
in each directory (where the directory is the Category) is converted into a 
sequence file of form <Text,Text>  where the Text key is /Category/doc_id.

`$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...` into a 
sequence file of form <Text, VectorWritable> leaving the Keys unchanged.  

`$mahout trainnb ... -el ...` then extracts the label from the Keys of the 
training data ie. the "Category" from /Category/doc_id.  

please see http://mahout.apache.org/users/classification/twenty-newsgroups.html
and http://mahout.apache.org/users/classification/bayesian.html
for more information.

> Date: Mon, 1 Dec 2014 17:09:55 +0100
> Subject: Insights to Naive Bayes classifier example - 20news groups
> From: stransky...@gmail.com
> To: user@mahout.apache.org
> 
> Hello Mahout experts,
> 
> I am trying to follow some examples provided with Mahout and some features
> are not clear to me. It would be great if someone could clarify a bit more.
> 
> To prepare a the data (train and test) the following sequence of steps is
> perfomed (taken from mahout cookbook):
> 
> All input is merged into single dir:
> *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
> 
> Converted to hadoop sequence file and then vectorized:
> *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-**vectors
> -lnorm -nv -wt tfidf*
> 
> Devided to test and train data:
> *./mahout split*
> *-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
> *--trainingOutput ${WORK_DIR}/20news-train-vectors*
> *--testOutput ${WORK_DIR}/20news-test-vectors*
> *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*
> 
> Model is trained:
> *./mahout trainnb*
> *-i ${WORK_DIR}/20news-train-vectors -el*
> *-o ${WORK_DIR}/model*
> *-li ${WORK_DIR}/labelindex*
> *-ow*
> 
> 
> What I am missing here and that is subject of my question is: Where is the
> category assigned to the testing data to train the categorization? What I
> would expect is that there will be vector which says that this document
> belongs to a particular category. This seems to me has been ereased by
> first step where we mixed all the data to create our corpus. I would still
> expect that this information will be somewhere retained. Instead the
> messages looks as follows:
> 
> From: y...@a.cs.okstate.edu (YEO YEK CHONG)
> Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
> Organization: Oklahoma State University
> Lines: 7
> 
> From article <a4fm3b1w1...@vicuna.ocunix.on.ca>, by Steve Frampton <
> framp...@vicuna.ocunix.on.ca>:
> > I was wondering, is the "Kermit" package (the actual package, not a
> 
> Yes!  In the usual ftp sites.
> 
> Yek CHong
> 
> 
> There is no notion from which group this text belongs to. What's the hack!
> 
> Could someone please clarify a bit what's going on as when crosswalidation
> is performed - confusion matrix takes into consideration those categories.
> 
> Thanks a lot for helping me out
> Jakub
                                          

Reply via email to