Dear All,

Our team is trying to implement a parallelized LDA with Gibbs Sampling. We
are using the algorithm mentioned by plda, http://code.google.com/p/plda/

The problem is that by the Map-Reduce method the paper mentioned. We need to
run a MapReduce job every gibbs sampling iteration, and normally, it will
use 1000 - 2000 iterations per our test with our data to converge. But as we
know, there is a cost to re-create the mapper/reducer, and cleanup the
mapper/reducer in every iteration. It will take about 40 seconds on our
cluster per our test, and 1000 iteration means almost 12 hours.

I am wondering if there is a way to reduce the cost of Mapper/Reducer
setup/cleanup, since I prefer to have all the mappers to read the same local
data and update the local data in a mapper process. All the other update it
need comes from the reducer which is a pretty small data compare to the
whole dataset.

Is there any approach I could try(including change part of hadoop's source
code.)?


Best wishes,
Stanley Xu

Reply via email to