Weird timing given our discussion, thanks to @ogrisel in my Tweet inbox: "Revisiting k-means: New Algorithms via Bayesian Nonparametrics" http://arxiv.org/abs/1111.0352
On Nov 2, 2011, at 6:31 PM, Jeff Eastman wrote: > Another problem that has been noted before and not fixed is that sampling > from the posterior of model distributions is done by copying the posterior > model and not (is it Gibbs?) sampling of its parameters. As I understand it > this is a maximum likelihood sampling hack that seems to work pretty well, > but not true DPC. I wish I had a better understanding of this aspect. > > -----Original Message----- > From: Frank Scholten [mailto:fr...@frankscholten.nl] > Sent: Wednesday, November 02, 2011 3:11 PM > To: dev@mahout.apache.org > Subject: Re: Dirchlet > > On Wed, Nov 2, 2011 at 11:05 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: >> I have done some testing and have been unable to demonstrate a big >> difference in allocating versus re-using. Re-using is, however, *really* >> error prone. >> >> I think that most of the supposed cost of new allocations is actually the >> cost of copying of large data rather than the cost of allocating the >> container. Here, the largest copy is the new DenseVector. >> >> All of these pale behind bad arithmetic and no combiner. > > Yeah, makes sense. > >> >> On Wed, Nov 2, 2011 at 2:37 PM, Frank Scholten <fr...@frankscholten.nl>wrote: >> >>> Maybe not a major thing but in the DirichletMapper I see that >>> Writables are not reused but new-ed >>> >>> Line 44: context.write(new Text(String.valueOf(k)), v); >>> >>> and in the for loop in the setup method >>> >>> Line 58: context.write(new Text(Integer.toString(i)), new >>> VectorWritable(new DenseVector(0))); >>> >>> See >>> http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ >>> >>> Frank >>> >>> On Wed, Nov 2, 2011 at 10:13 PM, Grant Ingersoll <gsing...@apache.org> >>> wrote: >>>> Tim Potter and I have tried running Dirchlet in the past on the ASF >>> email set on EC2 and it didn't seem to scale all that well, so I was >>> wondering if people had ideas on improving it's speed. One question I had >>> is whether we could inject a Combiner into the process? Ted also mentioned >>> that there might be faster ways to check the models, but I will ask him to >>> elaborate. >>>> >>>> Thanks, >>>> Grant >>> >> -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com