Dear Mahout developers,

I am planning to contribute to the Dirichlet Process Clustering algorithm
implemented by Jeff (Eastman). I have read through the code in some detail,
and discussed a couple of points with Jeff already in order not to create a
mess. That way I could understand how the code originated and where I could
contribute.

Here I'd like to announce what I would like to do, in order to prompt your
feedback, before I create a JIRA issue and make patches.

Essentially I would like to repair those aspects of the code where the
algorithm is broken
- most importantly, the parameter re-estimation step currently is a maximum
likelihood re-estimation. So the algorithm is not guaranteed to do actual
training/converge/work at all. In order to do Gibbs sampling, proper
Bayesian re-estimation is required.
- the non-parametric aspect is lost by requiring the max number of clusters
and the alpha_0 parameters as inputs to the algorithm; the whole point of
the DPMM is that the number of clusters is not predefined, but given by the
data. I'd like to fix this.
- the normalization procedures for normalizing a vector of probabilities
divide each component by the max of the components, instead of their sum
- generalize probability distributions in dimension and priors, eg the
normal seems limited to a 2D ยต=(0,0), sigma=1 prior
- change terminology to "standard" machine learning terms, add a
mathematical description of what's happening in the algorithm

What do you think ?

Thanks
Sebastien

Reply via email to