Re: Tuning LDA on Reuters

Jeff Eastman Thu, 20 May 2010 08:33:54 -0700

Yes, digging into the slow performance, I noticed that the submissionthat replaced build-reuters.sh recently was doing tfidf. Changing thatto tf and using sequential access vectors gave me an immediate 2x evenwith the single mapper. The other stuff in the patch gave me about 3xmore using two data nodes and would surely scale better than any singlemapper solution.

I think the problem lies more in the text preprocessing steps, whichcurrently output only a single vector file. By increasing the number ofreducers one obtains more parallelism in producing the vectors and alsomore vector files to feed to the final processing steps, whether LDA ork-Means, etc.

Tweaking the input split sizes in the final steps is a way to addressthe single-vector issue without fixing the preprocessing to give morefiles. The only thing I'm uncertain about is whether the patchintroduces any unintended consequences if the dictionary gets big enoughto be sharded.


On 5/19/10 10:10 PM, Grant Ingersoll wrote:

You might find 
http://www.lucidimagination.com/search/document/39b53fbf4b525f2f/lda_only_executes_a_single_map_task_per_iteration_when_running_in_actual_distributed_mode#311eb323a8208e28
 informative.

(BTW, LDA is only meant to run w/ TF)

-Grant

On May 19, 2010, at 9:49 PM, Jeff Eastman wrote:

I ran the Reuters dataset against LDA yesterday on a 2-node cluster and it took 
a really long time to converge (100 iterations * 10 min ea) extracting 20 
topics. I was able to reduce the iteration time by 50% by using just TF and 
SeqAccSparseVectors but it was still only using a single mapper and that was 
where most of the time was spent. Digging backwards, I found that there is only 
a single vector file produced by seqtosparse and also seqdirectory so that made 
sense.

I tried adding a '-chunk 5' param to seqdirectory but internally that got 
boosted up to 64 so I removed the boost code and am now able to get 3 part 
files in tokenized-documents.

I've tried a similar trick with seqtosparse, but its chunk argument only 
affects the dictionary.file chunking. I also tried running it with 4 reducers 
but I still get only a single part file in vectors. (It does seem that 
seqtosparse would produce multiple partial vector files if the dictionary were 
chunked, but the code then recombines those vectors to produce a single file.)

I cannot imagine how one could ever get LDA to scale if it is always limited to 
a single input vector file. Is there a way to get multiple output vector files 
from seqtosparse?

Re: Tuning LDA on Reuters

Reply via email to