Weird timing given our discussion, thanks to @ogrisel in my Tweet inbox: 
"Revisiting k-means: New Algorithms via Bayesian Nonparametrics"  
http://arxiv.org/abs/1111.0352


On Nov 2, 2011, at 6:31 PM, Jeff Eastman wrote:

> Another problem that has been noted before and not fixed is that sampling 
> from the posterior of model distributions is done by copying the posterior 
> model and not (is it Gibbs?) sampling of its parameters. As I understand it 
> this is a maximum likelihood sampling hack that seems to work pretty well, 
> but not true DPC. I wish I had a better understanding of this aspect.
> 
> -----Original Message-----
> From: Frank Scholten [mailto:fr...@frankscholten.nl] 
> Sent: Wednesday, November 02, 2011 3:11 PM
> To: dev@mahout.apache.org
> Subject: Re: Dirchlet
> 
> On Wed, Nov 2, 2011 at 11:05 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>> I have done some testing and have been unable to demonstrate a big
>> difference in allocating versus re-using.  Re-using is, however, *really*
>> error prone.
>> 
>> I think that most of the supposed cost of new allocations is actually the
>> cost of copying of large data rather than the cost of allocating the
>> container.  Here, the largest copy is the new DenseVector.
>> 
>> All of these pale behind bad arithmetic and no combiner.
> 
> Yeah, makes sense.
> 
>> 
>> On Wed, Nov 2, 2011 at 2:37 PM, Frank Scholten <fr...@frankscholten.nl>wrote:
>> 
>>> Maybe not a major thing but in the DirichletMapper I see that
>>> Writables are not reused but new-ed
>>> 
>>> Line 44: context.write(new Text(String.valueOf(k)), v);
>>> 
>>> and in the for loop in the setup method
>>> 
>>> Line 58: context.write(new Text(Integer.toString(i)), new
>>> VectorWritable(new DenseVector(0)));
>>> 
>>> See
>>> http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
>>> 
>>> Frank
>>> 
>>> On Wed, Nov 2, 2011 at 10:13 PM, Grant Ingersoll <gsing...@apache.org>
>>> wrote:
>>>> Tim Potter and I have tried running Dirchlet in the past on the ASF
>>> email set on EC2 and it didn't seem to scale all that well, so I was
>>> wondering if people had ideas on improving it's speed.  One question I had
>>> is whether we could inject a Combiner into the process?  Ted also mentioned
>>> that there might be faster ways to check the models, but I will ask him to
>>> elaborate.
>>>> 
>>>> Thanks,
>>>> Grant
>>> 
>> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com



Reply via email to