On Mon, Sep 13, 2010 at 3:41 AM, Gangadhar Nittala <[email protected]>wrote:
> All, > > I am following the details given in the Mahout wiki to run the Bayes > example [ https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html > ] with the 0.4 trunk code. I had to make a few modifications to the > commands to match the 0.4 snapshot, but when I run the Step 6 - to > train the classifier thus (I was able to get everything till Step 5 > right), $HADOOP_HOME/bin/hadoop jar > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3 > --input wikipediainput10 --output wikipediamodel10 --classifierType > bayes --dataSource hdfs, the machine runs out of disk-space. > > I did not run this for the complete enwiki-latest-pages-articles.xml > but only a part of the complete articles - > enwiki-latest-pages-articles10.xml. > [ > http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml.bz2 > ]. > Even with this, the HDFS fills up my 50 GB disk. Is this normal ? Does > the training of the classifier consume so much space ? Or is this > something that can be controlled via hadoop settings? I ask this > because, when I terminated the classifier process, stopped hadoop > (executed $HADOOP_HOME/bin/stop-all.sh) and checked the disk space, it > was back to what it was (around 43 GB free). > Yes, for now the classifier doesn't delete intermediate files. The final model is much smaller < 1GB > > If the space usage is normal, is there a smaller set over which I can > run the classifier ? I want to see the output for the classifier > before I try to understand the code (also the intent was for me to > understand how to run Mahout algorithms and write example code). > Should I be asking these sort of questions on the mahout-users list ? > Try to use the WikipediaDatasetCreator to select articles from a given category list. See the code for more details > > Thank you > Gangadhar >
