Dear Mahout developers, I am planning to contribute to the Dirichlet Process Clustering algorithm implemented by Jeff (Eastman). I have read through the code in some detail, and discussed a couple of points with Jeff already in order not to create a mess. That way I could understand how the code originated and where I could contribute.
Here I'd like to announce what I would like to do, in order to prompt your feedback, before I create a JIRA issue and make patches. Essentially I would like to repair those aspects of the code where the algorithm is broken - most importantly, the parameter re-estimation step currently is a maximum likelihood re-estimation. So the algorithm is not guaranteed to do actual training/converge/work at all. In order to do Gibbs sampling, proper Bayesian re-estimation is required. - the non-parametric aspect is lost by requiring the max number of clusters and the alpha_0 parameters as inputs to the algorithm; the whole point of the DPMM is that the number of clusters is not predefined, but given by the data. I'd like to fix this. - the normalization procedures for normalizing a vector of probabilities divide each component by the max of the components, instead of their sum - generalize probability distributions in dimension and priors, eg the normal seems limited to a 2D ยต=(0,0), sigma=1 prior - change terminology to "standard" machine learning terms, add a mathematical description of what's happening in the algorithm What do you think ? Thanks Sebastien