Re: Tuning LDA on Reuters

Drew Farris Wed, 19 May 2010 19:10:48 -0700

Jeff,

Just curious, have you tried:

./bin/mahout seq2sparse -Dmapred.reduce.tasks=2 -i
./examples/bin/work/reuters-out-seqdir/ -o
./examples/bin/work/reuters-out-seqdir-sparse -wt tf -seq

The mahout script (MahoutDriver) allows arbitrary hadoop properties to be
specified via -D arguments which are handled iirc by the
GenericOptionsParser. Admittedly, I'm divided between 'add an explicit
argument/job parameter to handle it' vs. 'use hadoop built in properties'.
The former are nice because it is an explicit acknowledgement of the ability
to set the number of reducers displayable via a -h method, but the latter
results in less code to maintain. In this vein, -Dmapred.min.split.size can
be tinkered with. Of course if GenericOptionsParser isn't involved in the
job setup this is all a moot point.

Also worth pointing out that either of these -D argument could be specified
in the lda.props (lda-reuters.props?) file too via something like 'DmyProp =
provalue'

Of course this doesn't really address the root problem however -- why LDA on
reuters is slow. How long is it taking to run?

Drew

On Wed, May 19, 2010 at 7:08 PM, Jeff Eastman <[email protected]>wrote:

> On 5/19/10 3:19 PM, Jeff Eastman wrote:
>
>>  I tried propagating numReducers into its makePartialVectors driver;
>> however, but a single reducer is still all I get. I need to figure out how
>> to tickle the elephant to give me more.
>>
> Note to self: Use a real elephant. Running Hadoop in Eclipse is great for
> debugging but it does not launch multiple mappers or reducers. Running on a
> single-host Hadoop cluster; however, does and the elephant is now dancing
> nicely.
>
> ./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o
> ./examples/bin/work/reuters-out-seqdir-sparse -wt tf -seq -nr 2
>
> now produces two input vector files for LDA to munch on. Now to try it on a
> real cluster...
>

Re: Tuning LDA on Reuters

Reply via email to