Re: Mahout's Dirichlet Process Mixture Model implementation

Ted Dunning Tue, 09 Jun 2009 14:21:09 -0700

Fabulous!

Some details in-line.

On Mon, Jun 8, 2009 at 7:06 AM, Sebastien Bratieres <sb...@cam.ac.uk> wrote:

> - most importantly, the parameter re-estimation step currently is a maximum
> likelihood re-estimation. So the algorithm is not guaranteed to do actual
> training/converge/work at all. In order to do Gibbs sampling, proper
> Bayesian re-estimation is required.
>

Thanks.  I had a suspicion about this, but never had time to check in
detail.

>  - the non-parametric aspect is lost by requiring the max number of
> clusters
> and the alpha_0 parameters as inputs to the algorithm; the whole point of
> the DPMM is that the number of clusters is not predefined, but given by the
> data. I'd like to fix this.
>

This is only partially true.  By setting the max number of clusters to a
large number, you get a good approximation to the truly non-parametric
solution.

Setting alpha_0 is more of a hack.  It would be better to sample it to
provide a fully Gibbs' sampler, but this is not all that critical in many
situations since we can also do MAP estimation of good values for alpha_0
and use that.

> - the normalization procedures for normalizing a vector of probabilities
> divide each component by the max of the components, instead of their sum
>

Good point.

>  - generalize probability distributions in dimension and priors, eg the
> normal seems limited to a 2D µ=(0,0), sigma=1 prior
>

Also a good idea.  I had thought that the normal distribution was only an
example.  It should definitely be generalized if not.  Commons math has the
beginnings of a good abstraction for distributions.  I would recommend
working with them.

> - change terminology to "standard" machine learning terms, add a
> mathematical description of what's happening in the algorithm
>

Fabulous.

>  What do you think ?
>

Go for it.

The best next step is to file several JIRA issues and start adding patches.
Mahout moves very fast so good patches will get constructive criticism or be
committed very quickly.

-- 
Ted Dunning, CTO
DeepDyve

Re: Mahout's Dirichlet Process Mixture Model implementation

Reply via email to