On Mon, Jan 11, 2010 at 2:00 PM, Chad Hinton <[email protected]> wrote: > I saw two comments related to an actual distributed run of the LDA example > but no answer to this question. A previous message in the list confirms that > at least one other person has experienced this issue. I am submitting a map > reduce job to a 20 node Hadoop cluster as follows: > > hadoop jar /root/mahout-core-0.2.job > org.apache.mahout.clustering.lda.LDADriver -i > hdfs://master/lda/input/vectors -o hdfs://master/lda/output -k 20 -v 10000 > --maxIter 40 > > where lda/input/vectors is the vectors file generated from the stand alone > build-reuters.sh example. I can only get a single map task to execute while > approx. 57 task slots are available. Has anyone actually ran distributed LDA > successfully? This will help me figure out if I have a hadoop config issue > or if there is an actual algorithm implementation problem. The Hadoop > examples run successfully in distributed mode utilizing all available map > tasks. I'm not sure if there is an issue with the InputSplit for the > SequenceFile or something else... Any help is appreciated.
I myself haven't actually run LDA distributed (though I've spoken with someone who has). The Reuters example is pretty simplistic, and doesn't set any input splits for the single vectors file, and so it's only going to run on one machine. If you shard the vectors it should just work. I can brush up on my hadoop foo to figure out how to have hadoop split up a single file, if you want. -- David
