Yes, digging into the slow performance, I noticed that the submission
that replaced build-reuters.sh recently was doing tfidf. Changing that
to tf and using sequential access vectors gave me an immediate 2x even
with the single mapper. The other stuff in the patch gave me about 3x
more using two data nodes and would surely scale better than any single
mapper solution.
I think the problem lies more in the text preprocessing steps, which
currently output only a single vector file. By increasing the number of
reducers one obtains more parallelism in producing the vectors and also
more vector files to feed to the final processing steps, whether LDA or
k-Means, etc.
Tweaking the input split sizes in the final steps is a way to address
the single-vector issue without fixing the preprocessing to give more
files. The only thing I'm uncertain about is whether the patch
introduces any unintended consequences if the dictionary gets big enough
to be sharded.
On 5/19/10 10:10 PM, Grant Ingersoll wrote:
You might find
http://www.lucidimagination.com/search/document/39b53fbf4b525f2f/lda_only_executes_a_single_map_task_per_iteration_when_running_in_actual_distributed_mode#311eb323a8208e28
informative.
(BTW, LDA is only meant to run w/ TF)
-Grant
On May 19, 2010, at 9:49 PM, Jeff Eastman wrote:
I ran the Reuters dataset against LDA yesterday on a 2-node cluster and it took
a really long time to converge (100 iterations * 10 min ea) extracting 20
topics. I was able to reduce the iteration time by 50% by using just TF and
SeqAccSparseVectors but it was still only using a single mapper and that was
where most of the time was spent. Digging backwards, I found that there is only
a single vector file produced by seqtosparse and also seqdirectory so that made
sense.
I tried adding a '-chunk 5' param to seqdirectory but internally that got
boosted up to 64 so I removed the boost code and am now able to get 3 part
files in tokenized-documents.
I've tried a similar trick with seqtosparse, but its chunk argument only
affects the dictionary.file chunking. I also tried running it with 4 reducers
but I still get only a single part file in vectors. (It does seem that
seqtosparse would produce multiple partial vector files if the dictionary were
chunked, but the code then recombines those vectors to produce a single file.)
I cannot imagine how one could ever get LDA to scale if it is always limited to
a single input vector file. Is there a way to get multiple output vector files
from seqtosparse?