Hello Mahout experts, I am trying to follow some examples provided with Mahout and some features are not clear to me. It would be great if someone could clarify a bit more.
To prepare a the data (train and test) the following sequence of steps is perfomed (taken from mahout cookbook): All input is merged into single dir: *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all* Converted to hadoop sequence file and then vectorized: *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-**vectors -lnorm -nv -wt tfidf* Devided to test and train data: *./mahout split* *-i ${WORK_DIR}/20news-vectors/tfidf-vectors* *--trainingOutput ${WORK_DIR}/20news-train-vectors* *--testOutput ${WORK_DIR}/20news-test-vectors* *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential* Model is trained: *./mahout trainnb* *-i ${WORK_DIR}/20news-train-vectors -el* *-o ${WORK_DIR}/model* *-li ${WORK_DIR}/labelindex* *-ow* What I am missing here and that is subject of my question is: Where is the category assigned to the testing data to train the categorization? What I would expect is that there will be vector which says that this document belongs to a particular category. This seems to me has been ereased by first step where we mixed all the data to create our corpus. I would still expect that this information will be somewhere retained. Instead the messages looks as follows: From: y...@a.cs.okstate.edu (YEO YEK CHONG) Subject: Re: Is "Kermit" available for Windows 3.0/3.1? Organization: Oklahoma State University Lines: 7 >From article <a4fm3b1w1...@vicuna.ocunix.on.ca>, by Steve Frampton < framp...@vicuna.ocunix.on.ca>: > I was wondering, is the "Kermit" package (the actual package, not a Yes! In the usual ftp sites. Yek CHong There is no notion from which group this text belongs to. What's the hack! Could someone please clarify a bit what's going on as when crosswalidation is performed - confusion matrix takes into consideration those categories. Thanks a lot for helping me out Jakub