Horizontally scaling / speeding up Mahout's LDA

2013-12-13 Thread Vishnu Modi
I was experimenting with using Mahout's LDA algorithm. My corpus has around
8 small documents, and roughly 45,000 terms. I was getting good
results, but the algorithm takes too long to run. On every iteration the
mapper takes around an hour, so with 10 iterations it takes a little over
10 hours to run. I notice that even though I'm running on a large hdfs
cluster, each mapper stage is run in only a single mapper. The reducer
stage is run on a large number of reducers, but even if run on only one
reducer it takes less than a minute so in my case this part doesn't need
scalability.

I'm running LDA through the CVB0Driver class. My parameters:

numTopics = 50
numTerms = number of unique terms seen across all documents
alpha = 1 (originally I tried the default, .0001 for alpha and eta)
eta = 1

For everything else I'm just using the defaults. Is it possible somehow to
get the job to run faster (other than lowering the number of topics or
terms)? Would the algorithm not work if it used more than 1 mapper?

Thanks for any help!
Vishnu


Re: Horizontally scaling / speeding up Mahout's LDA

2013-12-13 Thread Gokhan Capan
Hi Vishnu,

You may reduce the split size by setting mapred.max.split.size
configuration parameter of hadoop.

Number of map tasks then will be equal to  number of splits (input
size/split size)

Best
Sent from my iPhone

 On Dec 13, 2013, at 21:08, Vishnu Modi vishnu.modi...@gmail.com wrote:

 I was experimenting with using Mahout's LDA algorithm. My corpus has around
 8 small documents, and roughly 45,000 terms. I was getting good
 results, but the algorithm takes too long to run. On every iteration the
 mapper takes around an hour, so with 10 iterations it takes a little over
 10 hours to run. I notice that even though I'm running on a large hdfs
 cluster, each mapper stage is run in only a single mapper. The reducer
 stage is run on a large number of reducers, but even if run on only one
 reducer it takes less than a minute so in my case this part doesn't need
 scalability.

 I'm running LDA through the CVB0Driver class. My parameters:

 numTopics = 50
 numTerms = number of unique terms seen across all documents
 alpha = 1 (originally I tried the default, .0001 for alpha and eta)
 eta = 1

 For everything else I'm just using the defaults. Is it possible somehow to
 get the job to run faster (other than lowering the number of topics or
 terms)? Would the algorithm not work if it used more than 1 mapper?

 Thanks for any help!
 Vishnu