Re: Tuning LDA on Reuters

Jeff Eastman Wed, 19 May 2010 15:20:28 -0700

On 5/19/10 1:49 PM, Drew Farris wrote:

On Wed, May 19, 2010 at 3:49 PM, Jeff Eastman<[email protected]>wrote:

I cannot imagine how one could ever get LDA to scale if it is always
limited to a single input vector file. Is there a way to get multiple output
vector files from seqtosparse?

I don't know offhand, but is the default input split (mapred.min.split.size)
size too large for this particular use case? (if it is 0/unspecified it
defaults to the block size, which is 64MB). I wonder if setting that smaller
will allow more mappers to spawn.

The single input file for Reuters is only 18mb and it is well below thedefault min.split.size so yes, 64mb is large for this use case. Thepreprocessing steps are only producing a single vector file for *any*set of input documents; however, and that seems to be inherentlylimiting. I was hoping to induce Hadoop to spawn more mappers bycreating more input files, but lowering the split size would be anotherapproach. I'd hate to hardwire a smaller size into the LDA job thoughbecause that would affect all applications of the algorithm. Another jobparameter might resolve that problem but it seems pretty low level for auser to specify.

I'd be happier if the existing numReducers parameter to theSparseVectorsFromSeq job would impact the number of vector output files.It calls the DictionaryVectorizer to convert the tokenized input <docId,{token}> seqFiles into <docId, Vector> files and I discovered thatdriver currently only launches a single reducer (aha?). I triedpropagating numReducers into its makePartialVectors driver; however, buta single reducer is still all I get. I need to figure out how to ticklethe elephant to give me more.

Of course, another avenue is to look at why Reuters - a tiny dataset byHadoop standards - doesn't give same-day-service even with a singlemapper. That's an exercise in profiling which begs to be explored.

Kind of a long-winded reply to your suggestion, but it helps me to layit all out on the table,

Jeff

Re: Tuning LDA on Reuters

Reply via email to