Hi Jakub To label the training data for Bayesian classification in Mahout, all you do is just simply place your text training file into folders with the desired label as folder names. For example, in case of 20-news group, you can place your text into following folders as,
[hadoop@localhost 20news-all]$ ls alt.atheism comp.sys.ibm.pc.hardware misc.forsale rec.sport.baseball sci.electronics soc.religion.christian talk.politics.misc comp.graphics comp.sys.mac.hardware rec.autos rec.sport.hockey sci.med talk.politics.guns talk.religion.misc comp.os.ms-windows.misc comp.windows.x rec.motorcycles sci.crypt sci.space talk.politics.mideast [hadoop@localhost 20news-all]$ Mahout receives its folder/directory names as training data label and assigns to the documents under each folders. Send all these into HDFS and convert into SequenceFile. [hadoop@localhost 20news-all] $ $HADOOP_HOME/bin/hadoop dfs -put * 20News-All [hadoop@localhost 20news-all] $MAHOUT_HOME/bin/mahout seqdirectory -i 20News-All -o 20News-Seq General Term-Vectors [hadoop@localhost 20news-all] $MAHOUT_HOME/bin/mahout seq2sparse -i 20News-Seq -o 20News-Vectors -lnorm -nv -wt tfidf Split original labeled data into training data and test data (30%) [hadoop@localhost 20news-all] $MAHOUT_HOME/bin/mahout split -i 20News-Vectors/tfidf-vectors --trainingOutput 20News-Train-Vectors --testOutput 20News-Test-Vectors --randomSelectionPct 30 --overwrite --sequenceFiles --method sequential You will now have these on your HDFS. [hadoop@localhost 20news-all]$ $HADOOP_HOME/bin/hadoop dfs -ls Found 11 items drwxr-xr-x - hadoop supergroup 0 2013-10-18 04:29 /user/hadoop/20News-All drwxr-xr-x - hadoop supergroup 0 2013-10-18 04:31 /user/hadoop/20News-Seq drwxr-xr-x - hadoop supergroup 0 2013-10-18 05:03 /user/hadoop/20News-Test-Vectors drwxr-xr-x - hadoop supergroup 0 2013-10-18 05:03 /user/hadoop/20News-Train-Vectors drwxr-xr-x - hadoop supergroup 0 2013-10-18 04:46 /user/hadoop/20News-Vectors Train your model with 70% of the data as training data. [hadoop@localhost 20news-all] $MAHOUT_HOME/bin/mahout trainnb -i 20News-Train-Vectors -el -o 20News-NBModel -li 20News-LabelIndex -ow [hadoop@localhost 20news-all] Test your model and check the confusion matrix. [hadoop@localhost 20news-all]$ $MAHOUT_HOME/bin/mahout testnb -i 20News-Test-Vectors -m 20News-NBModel -l 20News-LabelIndex -ow -o 20News-NB-Testing [hadoop@localhost 20news-all] You will see like, 13/10/18 05:23:33 INFO test.TestNaiveBayesDriver: Standard NB Results: ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 5172 91.4912% Incorrectly Classified Instances : 481 8.5088% Total Classified Instances : 5653 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f g h i j k l m n o p q r s t <--Classified as 234 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 9 1 | 248 a = alt.atheism 0 256 6 12 10 8 3 0 0 0 0 1 2 1 1 0 0 0 1 0 | 301 b = comp.graphics 1 15 236 30 5 10 3 0 0 0 0 2 0 0 1 0 0 0 0 1 | 304 c = comp.os.ms-windows.misc 0 3 8 263 8 3 6 0 0 0 0 0 6 0 0 0 0 0 0 0 | 297 d = comp.sys.ibm.pc.hardware 1 5 3 8 251 2 3 1 1 0 0 0 2 0 0 0 0 0 0 0 | 277 e = comp.sys.mac.hardware 0 13 1 2 4 277 2 0 0 0 0 0 1 0 2 0 0 0 0 0 | 302 f = comp.windows.x 0 2 3 15 3 1 233 6 2 2 0 1 9 1 2 0 0 1 0 1 | 282 g = misc.forsale 0 2 1 1 3 0 8 255 3 0 0 0 4 1 0 0 0 0 0 0 | 278 h = rec.autos 0 0 0 0 0 0 0 6 270 0 0 0 0 0 0 0 0 0 0 0 | 276 i = rec.motorcycles 0 0 0 2 1 0 1 1 1 269 2 0 1 0 1 0 0 0 0 0 | 279 j = rec.sport.baseball 0 1 0 0 2 0 1 0 1 3 276 0 0 0 0 1 0 0 0 0 | 285 k = rec.sport.hockey 0 1 1 0 0 2 0 0 0 0 0 323 1 2 0 0 0 3 1 0 | 334 l = sci.crypt 0 3 0 9 7 2 3 4 0 0 1 3 260 1 2 0 0 0 0 0 | 295 m = sci.electronics 0 1 0 0 0 0 2 1 0 0 0 0 5 299 1 2 0 1 0 1 | 313 n = sci.med 0 0 0 0 2 1 0 0 0 0 0 0 1 1 291 0 0 0 0 2 | 298 o = sci.space 1 2 0 0 1 1 1 0 0 0 0 0 0 4 0 281 3 0 4 1 | 299 p = soc.religion.christian 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 295 1 0 2 | 298 q = talk.politics.mideast 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 253 1 11 | 267 r = talk.politics.guns 16 1 0 0 0 1 0 1 0 0 0 0 0 0 2 12 2 6 142 4 | 187 s = talk.religion.misc 1 1 0 1 0 0 0 0 0 0 0 1 0 3 3 0 0 13 2 208 | 233 t = talk.politics.misc 13/10/18 05:23:33 INFO driver.MahoutDriver: Program took 35037 ms (Minutes: 0.584) [hadoop@localhost 20news-all]$ I thought that I've done this on 0.7 or 0.8. (Have not tried on 0.9 yet.) Regards,,, Y.Mandai 2014-12-01 22:09 GMT+09:00 Jakub Stransky <stransky...@gmail.com>: > Hello experienced mahout users, > > I am new to mahout and I am trying to run naive bayes classification > example with 20news groups categories. I do not userstand one thing which I > am unable to spot. To train categorization I need a labeled data. I don't > see the way how the label of a particular document is passed to training > the model. > I think that I understand TF and IDF etc. but simply dont see how label is > passes. > > Could someone provide some insight into this? > > Thx > Jakub >