Haven't tried that approach yet but it may have the same effect.
Seq2sparse already has an option to set the number of reducers (-nr) it
was just not propagating to the vector-generation last stage. Would the
-D option override that option? Would it apply to all the hadoop jobs
spawned by seq2sparse? If so then they are probably equivalent.
On 5/19/10 7:10 PM, Drew Farris wrote:
Jeff,
Just curious, have you tried:
./bin/mahout seq2sparse -Dmapred.reduce.tasks=2 -i
./examples/bin/work/reuters-out-seqdir/ -o
./examples/bin/work/reuters-out-seqdir-sparse -wt tf -seq
The mahout script (MahoutDriver) allows arbitrary hadoop properties to be
specified via -D arguments which are handled iirc by the
GenericOptionsParser. Admittedly, I'm divided between 'add an explicit
argument/job parameter to handle it' vs. 'use hadoop built in properties'.
The former are nice because it is an explicit acknowledgement of the ability
to set the number of reducers displayable via a -h method, but the latter
results in less code to maintain. In this vein, -Dmapred.min.split.size can
be tinkered with. Of course if GenericOptionsParser isn't involved in the
job setup this is all a moot point.
Also worth pointing out that either of these -D argument could be specified
in the lda.props (lda-reuters.props?) file too via something like 'DmyProp =
provalue'
Of course this doesn't really address the root problem however -- why LDA on
reuters is slow. How long is it taking to run?
Drew
On Wed, May 19, 2010 at 7:08 PM, Jeff Eastman<[email protected]>wrote:
On 5/19/10 3:19 PM, Jeff Eastman wrote:
I tried propagating numReducers into its makePartialVectors driver;
however, but a single reducer is still all I get. I need to figure out how
to tickle the elephant to give me more.
Note to self: Use a real elephant. Running Hadoop in Eclipse is great for
debugging but it does not launch multiple mappers or reducers. Running on a
single-host Hadoop cluster; however, does and the elephant is now dancing
nicely.
./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o
./examples/bin/work/reuters-out-seqdir-sparse -wt tf -seq -nr 2
now produces two input vector files for LDA to munch on. Now to try it on a
real cluster...