Jeff, Just curious, have you tried:
./bin/mahout seq2sparse -Dmapred.reduce.tasks=2 -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse -wt tf -seq The mahout script (MahoutDriver) allows arbitrary hadoop properties to be specified via -D arguments which are handled iirc by the GenericOptionsParser. Admittedly, I'm divided between 'add an explicit argument/job parameter to handle it' vs. 'use hadoop built in properties'. The former are nice because it is an explicit acknowledgement of the ability to set the number of reducers displayable via a -h method, but the latter results in less code to maintain. In this vein, -Dmapred.min.split.size can be tinkered with. Of course if GenericOptionsParser isn't involved in the job setup this is all a moot point. Also worth pointing out that either of these -D argument could be specified in the lda.props (lda-reuters.props?) file too via something like 'DmyProp = provalue' Of course this doesn't really address the root problem however -- why LDA on reuters is slow. How long is it taking to run? Drew On Wed, May 19, 2010 at 7:08 PM, Jeff Eastman <[email protected]>wrote: > On 5/19/10 3:19 PM, Jeff Eastman wrote: > >> I tried propagating numReducers into its makePartialVectors driver; >> however, but a single reducer is still all I get. I need to figure out how >> to tickle the elephant to give me more. >> > Note to self: Use a real elephant. Running Hadoop in Eclipse is great for > debugging but it does not launch multiple mappers or reducers. Running on a > single-host Hadoop cluster; however, does and the elephant is now dancing > nicely. > > ./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o > ./examples/bin/work/reuters-out-seqdir-sparse -wt tf -seq -nr 2 > > now produces two input vector files for LDA to munch on. Now to try it on a > real cluster... >
