Re: LDA only executes a single map task per iteration when running in actual distributed mode?

deneche abdelhakim Tue, 12 Jan 2010 09:33:57 -0800

oups, sorry the size should be specified in bytes and not kB. so 8.8Mb ~ 
9227468b. to get 10 mappers use a mapred.max.split.size=922747


--- En date de : Mar 12.1.10, deneche abdelhakim <[email protected]> a écrit :

> De: deneche abdelhakim <[email protected]>
> Objet: Re: LDA only executes a single map task per iteration when running in  
> actual distributed mode?
> À: [email protected]
> Date: Mardi 12 Janvier 2010, 17h43
> try using a small value for Hadoop
> parameter "mapred.max.split.size". For a file size of 8.8 Mb
> (~9000 Kb) if you want 10 mappers you should use a max split
> size of 9000/10=900. 
> 
> I don't now if LDADriver implements Hadoop Tool interface,
> but if it does you can pass the desired value in the command
> line as follows:
> 
> hadoop jar /root/mahout-core-0.2.job
> org.apache.mahout.clustering.lda.LDADriver
> -Dmapred.max.split.size=900 -i
> hdfs://master/lda/input/vectors -o hdfs://master/lda/output
> -k 20 -v 10000
> --maxIter 40
> 
> Please note that it won't work if LDADriver is using a
> fancy InputFormat other than InputFileFormat. The easiest
> way to now is just to try it !
> 
> --- En date de : Mar 12.1.10, Chad Hinton <[email protected]>
> a écrit :
> 
> > De: Chad Hinton <[email protected]>
> > Objet: Re: LDA only executes a single map task per
> iteration when running in  actual distributed mode?
> > À: "mahout-user" <[email protected]>
> > Date: Mardi 12 Janvier 2010, 17h13
> > Ted, David - thanks for your replies.
> > I thought Hadoop would
> > automatically split the file but it is not. The
> vectors
> > file generated
> > from build-reuters.sh (by using
> > org.apache.mahout.utils.vectors.lucene.Driver over
> the
> > Lucene index)
> > comes out to around 8.8 mb. Perhaps that is to small
> and
> > won't be
> > split if it's below the HDFS block size. I'm using
> the
> > default 64mb
> > for the HDFS. Perhaps a custom InputSplit/RecordReader
> is
> > needed to
> > split the sequence file. I'll investigate further. If
> > anyone has
> > further pointers or more info please chime in.
> > 
> > Thanks,
> > Chad
> > 
> > > It should just happen if the file is large enough
> and
> > the program is
> > > configured for more than one mapper task and the
> file
> > type is correct.
> > 
> > > If you are reading an uncompressed sequence file
> you
> > should be set.
> > 
> > > On Mon, Jan 11, 2010 at 9:53 PM, David Hall
> <[email protected]>
> > wrote:
> > 
> > >>  I can brush up on my hadoop foo to figure
> > out how to have
> > >> hadoop split up a single file, if you want.
> > >>
> > 
> > >--
> > >Ted Dunning, CTO
> > >DeepDyve
> > 
> 
> 
> 
>

Re: LDA only executes a single map task per iteration when running in actual distributed mode?

Reply via email to