try using a small value for Hadoop parameter "mapred.max.split.size". For a file size of 8.8 Mb (~9000 Kb) if you want 10 mappers you should use a max split size of 9000/10=900.
I don't now if LDADriver implements Hadoop Tool interface, but if it does you can pass the desired value in the command line as follows: hadoop jar /root/mahout-core-0.2.job org.apache.mahout.clustering.lda.LDADriver -Dmapred.max.split.size=900 -i hdfs://master/lda/input/vectors -o hdfs://master/lda/output -k 20 -v 10000 --maxIter 40 Please note that it won't work if LDADriver is using a fancy InputFormat other than InputFileFormat. The easiest way to now is just to try it ! --- En date de : Mar 12.1.10, Chad Hinton <[email protected]> a écrit : > De: Chad Hinton <[email protected]> > Objet: Re: LDA only executes a single map task per iteration when running in > actual distributed mode? > À: "mahout-user" <[email protected]> > Date: Mardi 12 Janvier 2010, 17h13 > Ted, David - thanks for your replies. > I thought Hadoop would > automatically split the file but it is not. The vectors > file generated > from build-reuters.sh (by using > org.apache.mahout.utils.vectors.lucene.Driver over the > Lucene index) > comes out to around 8.8 mb. Perhaps that is to small and > won't be > split if it's below the HDFS block size. I'm using the > default 64mb > for the HDFS. Perhaps a custom InputSplit/RecordReader is > needed to > split the sequence file. I'll investigate further. If > anyone has > further pointers or more info please chime in. > > Thanks, > Chad > > > It should just happen if the file is large enough and > the program is > > configured for more than one mapper task and the file > type is correct. > > > If you are reading an uncompressed sequence file you > should be set. > > > On Mon, Jan 11, 2010 at 9:53 PM, David Hall <[email protected]> > wrote: > > >> I can brush up on my hadoop foo to figure > out how to have > >> hadoop split up a single file, if you want. > >> > > >-- > >Ted Dunning, CTO > >DeepDyve >
