On 5/19/10 1:49 PM, Drew Farris wrote:
On Wed, May 19, 2010 at 3:49 PM, Jeff Eastman<[email protected]>wrote:


I cannot imagine how one could ever get LDA to scale if it is always
limited to a single input vector file. Is there a way to get multiple output
vector files from seqtosparse?

I don't know offhand, but is the default input split (mapred.min.split.size)
size too large for this particular use case? (if it is 0/unspecified it
defaults to the block size, which is 64MB). I wonder if setting that smaller
will allow more mappers to spawn.

The single input file for Reuters is only 18mb and it is well below the default min.split.size so yes, 64mb is large for this use case. The preprocessing steps are only producing a single vector file for *any* set of input documents; however, and that seems to be inherently limiting. I was hoping to induce Hadoop to spawn more mappers by creating more input files, but lowering the split size would be another approach. I'd hate to hardwire a smaller size into the LDA job though because that would affect all applications of the algorithm. Another job parameter might resolve that problem but it seems pretty low level for a user to specify.

I'd be happier if the existing numReducers parameter to the SparseVectorsFromSeq job would impact the number of vector output files. It calls the DictionaryVectorizer to convert the tokenized input <docId, {token}> seqFiles into <docId, Vector> files and I discovered that driver currently only launches a single reducer (aha?). I tried propagating numReducers into its makePartialVectors driver; however, but a single reducer is still all I get. I need to figure out how to tickle the elephant to give me more.

Of course, another avenue is to look at why Reuters - a tiny dataset by Hadoop standards - doesn't give same-day-service even with a single mapper. That's an exercise in profiling which begs to be explored.

Kind of a long-winded reply to your suggestion, but it helps me to lay it all out on the table,
Jeff

Reply via email to