Dear All, Our team is trying to implement a parallelized LDA with Gibbs Sampling. We are using the algorithm mentioned by plda, http://code.google.com/p/plda/
The problem is that by the Map-Reduce method the paper mentioned. We need to run a MapReduce job every gibbs sampling iteration, and normally, it will use 1000 - 2000 iterations per our test with our data to converge. But as we know, there is a cost to re-create the mapper/reducer, and cleanup the mapper/reducer in every iteration. It will take about 40 seconds on our cluster per our test, and 1000 iteration means almost 12 hours. I am wondering if there is a way to reduce the cost of Mapper/Reducer setup/cleanup, since I prefer to have all the mappers to read the same local data and update the local data in a mapper process. All the other update it need comes from the reducer which is a pretty small data compare to the whole dataset. Is there any approach I could try(including change part of hadoop's source code.)? Best wishes, Stanley Xu