oups, sorry the size should be specified in bytes and not kB. so 8.8Mb ~ 9227468b. to get 10 mappers use a mapred.max.split.size=922747
--- En date de : Mar 12.1.10, deneche abdelhakim <[email protected]> a écrit : > De: deneche abdelhakim <[email protected]> > Objet: Re: LDA only executes a single map task per iteration when running in > actual distributed mode? > À: [email protected] > Date: Mardi 12 Janvier 2010, 17h43 > try using a small value for Hadoop > parameter "mapred.max.split.size". For a file size of 8.8 Mb > (~9000 Kb) if you want 10 mappers you should use a max split > size of 9000/10=900. > > I don't now if LDADriver implements Hadoop Tool interface, > but if it does you can pass the desired value in the command > line as follows: > > hadoop jar /root/mahout-core-0.2.job > org.apache.mahout.clustering.lda.LDADriver > -Dmapred.max.split.size=900 -i > hdfs://master/lda/input/vectors -o hdfs://master/lda/output > -k 20 -v 10000 > --maxIter 40 > > Please note that it won't work if LDADriver is using a > fancy InputFormat other than InputFileFormat. The easiest > way to now is just to try it ! > > --- En date de : Mar 12.1.10, Chad Hinton <[email protected]> > a écrit : > > > De: Chad Hinton <[email protected]> > > Objet: Re: LDA only executes a single map task per > iteration when running in actual distributed mode? > > À: "mahout-user" <[email protected]> > > Date: Mardi 12 Janvier 2010, 17h13 > > Ted, David - thanks for your replies. > > I thought Hadoop would > > automatically split the file but it is not. The > vectors > > file generated > > from build-reuters.sh (by using > > org.apache.mahout.utils.vectors.lucene.Driver over > the > > Lucene index) > > comes out to around 8.8 mb. Perhaps that is to small > and > > won't be > > split if it's below the HDFS block size. I'm using > the > > default 64mb > > for the HDFS. Perhaps a custom InputSplit/RecordReader > is > > needed to > > split the sequence file. I'll investigate further. If > > anyone has > > further pointers or more info please chime in. > > > > Thanks, > > Chad > > > > > It should just happen if the file is large enough > and > > the program is > > > configured for more than one mapper task and the > file > > type is correct. > > > > > If you are reading an uncompressed sequence file > you > > should be set. > > > > > On Mon, Jan 11, 2010 at 9:53 PM, David Hall > <[email protected]> > > wrote: > > > > >> I can brush up on my hadoop foo to figure > > out how to have > > >> hadoop split up a single file, if you want. > > >> > > > > >-- > > >Ted Dunning, CTO > > >DeepDyve > > > > > >
