I ran the Reuters dataset against LDA yesterday on a 2-node cluster and
it took a really long time to converge (100 iterations * 10 min ea)
extracting 20 topics. I was able to reduce the iteration time by 50% by
using just TF and SeqAccSparseVectors but it was still only using a
single mapper and that was where most of the time was spent. Digging
backwards, I found that there is only a single vector file produced by
seqtosparse and also seqdirectory so that made sense.
I tried adding a '-chunk 5' param to seqdirectory but internally that
got boosted up to 64 so I removed the boost code and am now able to get
3 part files in tokenized-documents.
I've tried a similar trick with seqtosparse, but its chunk argument only
affects the dictionary.file chunking. I also tried running it with 4
reducers but I still get only a single part file in vectors. (It does
seem that seqtosparse would produce multiple partial vector files if the
dictionary were chunked, but the code then recombines those vectors to
produce a single file.)
I cannot imagine how one could ever get LDA to scale if it is always
limited to a single input vector file. Is there a way to get multiple
output vector files from seqtosparse?
- Tuning LDA on Reuters Jeff Eastman
-