Thanks a lot. Ted, checking haloop and plume now. I could always get the answer from you. :-)
On Thu, May 5, 2011 at 10:42 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Stanley, > > The short answer is that this is a real problem. > > Try this: > > *Spark: Cluster Computing with Working Sets.* Matei Zaharia, Mosharaf > Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, in HotCloud 2010, > June 2010. > > Or this http://www.iterativemapreduce.org/ > > http://code.google.com/p/haloop/ > > You may be interested in experimenting with MapReduce 2.0. THat allows > more flexibility in execution model: > > > http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/ > > Systems like FlumeJava (and my open source, incomplete clone Plume) may > help with flexibility: > > > http://www.deepdyve.com/lp/association-for-computing-machinery/flumejava-easy-efficient-data-parallel-pipelines-xtUvap2t1I > > > https://github.com/tdunning/Plume/commit/a5a10feaa068b33b1d929c332e4614aba50dd39a > > > On Thu, May 5, 2011 at 2:16 AM, Stanley Xu <wenhao...@gmail.com> wrote: > >> Dear All, >> >> Our team is trying to implement a parallelized LDA with Gibbs Sampling. We >> are using the algorithm mentioned by plda, http://code.google.com/p/plda/ >> >> The problem is that by the Map-Reduce method the paper mentioned. We need >> to >> run a MapReduce job every gibbs sampling iteration, and normally, it will >> use 1000 - 2000 iterations per our test with our data to converge. But as >> we >> know, there is a cost to re-create the mapper/reducer, and cleanup the >> mapper/reducer in every iteration. It will take about 40 seconds on our >> cluster per our test, and 1000 iteration means almost 12 hours. >> >> I am wondering if there is a way to reduce the cost of Mapper/Reducer >> setup/cleanup, since I prefer to have all the mappers to read the same >> local >> data and update the local data in a mapper process. All the other update >> it >> need comes from the reducer which is a pretty small data compare to the >> whole dataset. >> >> Is there any approach I could try(including change part of hadoop's source >> code.)? >> >> >> Best wishes, >> Stanley Xu >> > >