On 5/19/10 1:49 PM, Drew Farris wrote:
On Wed, May 19, 2010 at 3:49 PM, Jeff Eastman<[email protected]>wrote:
I cannot imagine how one could ever get LDA to scale if it is always
limited to a single input vector file. Is there a way to get multiple output
vector files from seqtosparse?
I don't know offhand, but is the default input split (mapred.min.split.size)
size too large for this particular use case? (if it is 0/unspecified it
defaults to the block size, which is 64MB). I wonder if setting that smaller
will allow more mappers to spawn.
The single input file for Reuters is only 18mb and it is well below the
default min.split.size so yes, 64mb is large for this use case. The
preprocessing steps are only producing a single vector file for *any*
set of input documents; however, and that seems to be inherently
limiting. I was hoping to induce Hadoop to spawn more mappers by
creating more input files, but lowering the split size would be another
approach. I'd hate to hardwire a smaller size into the LDA job though
because that would affect all applications of the algorithm. Another job
parameter might resolve that problem but it seems pretty low level for a
user to specify.
I'd be happier if the existing numReducers parameter to the
SparseVectorsFromSeq job would impact the number of vector output files.
It calls the DictionaryVectorizer to convert the tokenized input <docId,
{token}> seqFiles into <docId, Vector> files and I discovered that
driver currently only launches a single reducer (aha?). I tried
propagating numReducers into its makePartialVectors driver; however, but
a single reducer is still all I get. I need to figure out how to tickle
the elephant to give me more.
Of course, another avenue is to look at why Reuters - a tiny dataset by
Hadoop standards - doesn't give same-day-service even with a single
mapper. That's an exercise in profiling which begs to be explored.
Kind of a long-winded reply to your suggestion, but it helps me to lay
it all out on the table,
Jeff