Thanks a lot for your interest and time. I'm computer-less for the coming week, but I'll run a few more experiments and post the data as soon as I'm back home.
Thanks. Benjamin Le 17 sept. 2011 à 00:24, Ted Dunning <ted.dunn...@gmail.com> a écrit : > Benjamin, > > Can you post your actual training data on dropbox or some other place so > that we can replicate the problem? > > On Fri, Sep 16, 2011 at 3:38 PM, Benjamin Rey > <benjamin....@c-optimal.com>wrote: > >> Unfortunately CNB gives me the same 66% accuracy. >> >> I past the commands for mahout and weka below. >> >> I also tried to remove the biggest class, it helps but then it's the 2nd >> biggest class that is overwhelmingly predicted. Mahout bayes seems to favor >> a lot the biggest class (more than prior), contrarily to Weka's >> implementation. Is there any choice in the parameters, or in ways of >> computing weights that could be causing this? >> >> thanks. >> >> benjamin >> >> here are the commands: >> On Mahout: >> # training set, usual prepare20newsgroup, followed by a subsampling for >> just >> a few classes >> bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p >> examples/bin/work/20news-bydate/20news-bydate-train -o >> examples/bin/work/20news-bydate/bayes-train-input -a >> org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 >> mkdir examples/bin/work/20news_ss/bayes-train-input/ >> head -400 examples/bin/work/20news-bydate/bayes-train-input/alt.atheism.txt >>> examples/bin/work/20news_ss/bayes-train-input/alt.atheism.txt >> head -200 >> examples/bin/work/20news-bydate/bayes-train-input/comp.graphics.txt > >> examples/bin/work/20news_ss/bayes-train-input/comp.graphics.txt >> head -100 examples/bin/work/20news-bydate/bayes-train-input/rec.autos.txt > >> examples/bin/work/20news_ss/bayes-train-input/rec.autos.txt >> head -100 >> examples/bin/work/20news-bydate/bayes-train-input/rec.sport.hockey.txt > >> examples/bin/work/20news_ss/bayes-train-input/rec.sport.hockey.txt >> head -30 examples/bin/work/20news-bydate/bayes-train-input/sci.med.txt > >> examples/bin/work/20news_ss/bayes-train-input/sci.med.txt >> hdput examples/bin/work/20news_ss/bayes-train-input >> examples/bin/work/20news_ss/bayes-train-input >> >> then same exact thing for testing >> >> # actual training: >> bin/mahout trainclassifier -i >> examples/bin/work/20news_ss/bayes-train-input -o >> examples/bin/work/20news-bydate/cbayes-model_ss -type cbayes -ng 1 >> -source hdfs >> >> # testing >> bin/mahout testclassifier -d >> examples/bin/work/20news_ss/bayes-test-input -m >> examples/bin/work/20news-bydate/cbayes-model_ss -type cbayes -ng 1 >> -source hdfs >> >> => 66% accuracy >> >> and for weka >> # create the .arff file from 20news_ss train and test: >> start the file with appropriate header: >> ----- >> @relation _home_benjamin_Data_BY_weka >> >> @attribute text string >> @attribute class >> {alt.atheism,comp.graphics,rec.autos,rec.sport.hockey,sci.med} >> >> @data >> ----- >> # then past the data: >> cat ~/workspace/mahout-0.5/examples/bin/work/20news_ss/bayes-test-input/* | >> perl mh2arff.pl >> 20news_ss_test.arff >> # with m2harff.pl : >> ---- >> use strict; >> while(<STDIN>) { >> chomp; >> $_ =~ s/\'/\\\'/g; >> $_ =~ s/ $//; >> my ($c,$t) = split("\t",$_); >> print "'$t',$c\n"; >> } >> --- >> # and the train/test command: >> java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T >> 20news_ss_test.arff -F >> "weka.filters.unsupervised.attribute.StringToWordVector -S" -W >> weka.classifiers.bayes.NaiveBayesMultinomial >> >> => 92% accuracy >> >> >> >> >> >> >> 2011/9/16 Robin Anil <robin.a...@gmail.com> >> >>> Did you try complementary naive bayes(CNB). I am guessing the multinomial >>> naivebayes mentioned here is a CNB like implementation and not NB. >>> >>> >>> On Fri, Sep 16, 2011 at 5:30 PM, Benjamin Rey < >> benjamin....@c-optimal.com >>>> wrote: >>> >>>> Hello, >>>> >>>> I'm giving a try to different classifiers for a classical problem of >> text >>>> classification very close to the 20newsgroup one. >>>> I end up with much better results with Weka NaiveBayesMultinomial than >>> with >>>> Mahout bayes. >>>> The main problem comes from the fact that my data is unbalanced. I know >>>> bayes has difficulties with that yet I'm surprised by the difference >>>> between >>>> weka and mahout. >>>> >>>> I went back to the 20newsgroup example, picked 5 classes only and >>>> subsampled >>>> those to get 5 classes with 400 200 100 100 and 30 examples and pretty >>> much >>>> the same for test set. >>>> On mahout with bayes 1-gram, I'm getting 66% correctly classified (see >>>> below >>>> for confusion matrix) >>>> On weka, on the same exact data, without any tuning, I'm getting 92% >>>> correctly classified. >>>> >>>> Would anyone know where the difference comes from and if there are ways >> I >>>> could tune Mahout to get better results? my data is small enough for >> now >>>> for weka but this won't last. >>>> >>>> Many thanks >>>> >>>> Benjamin. >>>> >>>> >>>> >>>> MAHOUT: >>>> ------------------------------------------------------- >>>> Correctly Classified Instances : 491 65.5541% >>>> Incorrectly Classified Instances : 258 34.4459% >>>> Total Classified Instances : 749 >>>> >>>> ======================================================= >>>> Confusion Matrix >>>> ------------------------------------------------------- >>>> a b c d e f <--Classified as >>>> 14 82 0 4 0 0 | 100 a >>> = >>>> rec.sport.hockey >>>> 0 319 0 0 0 0 | 319 b >>> = >>>> alt.atheism >>>> 0 88 3 9 0 0 | 100 c >>> = >>>> rec.autos >>>> 0 45 0 155 0 0 | 200 d >>> = >>>> comp.graphics >>>> 0 25 0 5 0 0 | 30 e >>> = >>>> sci.med >>>> 0 0 0 0 0 0 | 0 f >>> = >>>> unknown >>>> Default Category: unknown: 5 >>>> >>>> >>>> WEKA: >>>> java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff >> -T >>>> 20news_ss_test.arff -F >>>> "weka.filters.unsupervised.attribute.StringToWordVector -S" -W >>>> weka.classifiers.bayes.NaiveBayesMultinomial >>>> >>>> === Error on test data === >>>> >>>> Correctly Classified Instances 688 91.8558 % >>>> Incorrectly Classified Instances 61 8.1442 % >>>> Kappa statistic 0.8836 >>>> Mean absolute error 0.0334 >>>> Root mean squared error 0.1706 >>>> Relative absolute error 11.9863 % >>>> Root relative squared error 45.151 % >>>> Total Number of Instances 749 >>>> >>>> >>>> === Confusion Matrix === >>>> >>>> a b c d e <-- classified as >>>> 308 9 2 0 0 | a = alt.atheism >>>> 5 195 0 0 0 | b = comp.graphics >>>> 3 11 84 2 0 | c = rec.autos >>>> 3 3 0 94 0 | d = rec.sport.hockey >>>> 6 11 6 0 7 | e = sci.med >>>> >>> >>