Hi Vishnu,
You may reduce the split size by setting mapred.max.split.size
configuration parameter of hadoop.
Number of map tasks then will be equal to number of splits (input
size/split size)
Best
Sent from my iPhone
On Dec 13, 2013, at 21:08, Vishnu Modi vishnu.modi...@gmail.com wrote:
I was experimenting with using Mahout's LDA algorithm. My corpus has around
8 small documents, and roughly 45,000 terms. I was getting good
results, but the algorithm takes too long to run. On every iteration the
mapper takes around an hour, so with 10 iterations it takes a little over
10 hours to run. I notice that even though I'm running on a large hdfs
cluster, each mapper stage is run in only a single mapper. The reducer
stage is run on a large number of reducers, but even if run on only one
reducer it takes less than a minute so in my case this part doesn't need
scalability.
I'm running LDA through the CVB0Driver class. My parameters:
numTopics = 50
numTerms = number of unique terms seen across all documents
alpha = 1 (originally I tried the default, .0001 for alpha and eta)
eta = 1
For everything else I'm just using the defaults. Is it possible somehow to
get the job to run faster (other than lowering the number of topics or
terms)? Would the algorithm not work if it used more than 1 mapper?
Thanks for any help!
Vishnu