Yes, digging into the slow performance, I noticed that the submission that replaced build-reuters.sh recently was doing tfidf. Changing that to tf and using sequential access vectors gave me an immediate 2x even with the single mapper. The other stuff in the patch gave me about 3x more using two data nodes and would surely scale better than any single mapper solution.

I think the problem lies more in the text preprocessing steps, which currently output only a single vector file. By increasing the number of reducers one obtains more parallelism in producing the vectors and also more vector files to feed to the final processing steps, whether LDA or k-Means, etc.

Tweaking the input split sizes in the final steps is a way to address the single-vector issue without fixing the preprocessing to give more files. The only thing I'm uncertain about is whether the patch introduces any unintended consequences if the dictionary gets big enough to be sharded.

On 5/19/10 10:10 PM, Grant Ingersoll wrote:
You might find 
http://www.lucidimagination.com/search/document/39b53fbf4b525f2f/lda_only_executes_a_single_map_task_per_iteration_when_running_in_actual_distributed_mode#311eb323a8208e28
 informative.

(BTW, LDA is only meant to run w/ TF)

-Grant

On May 19, 2010, at 9:49 PM, Jeff Eastman wrote:

I ran the Reuters dataset against LDA yesterday on a 2-node cluster and it took 
a really long time to converge (100 iterations * 10 min ea) extracting 20 
topics. I was able to reduce the iteration time by 50% by using just TF and 
SeqAccSparseVectors but it was still only using a single mapper and that was 
where most of the time was spent. Digging backwards, I found that there is only 
a single vector file produced by seqtosparse and also seqdirectory so that made 
sense.

I tried adding a '-chunk 5' param to seqdirectory but internally that got 
boosted up to 64 so I removed the boost code and am now able to get 3 part 
files in tokenized-documents.

I've tried a similar trick with seqtosparse, but its chunk argument only 
affects the dictionary.file chunking. I also tried running it with 4 reducers 
but I still get only a single part file in vectors. (It does seem that 
seqtosparse would produce multiple partial vector files if the dictionary were 
chunked, but the code then recombines those vectors to produce a single file.)

I cannot imagine how one could ever get LDA to scale if it is always limited to 
a single input vector file. Is there a way to get multiple output vector files 
from seqtosparse?


Reply via email to