Fabulous! Some details in-line.
On Mon, Jun 8, 2009 at 7:06 AM, Sebastien Bratieres <sb...@cam.ac.uk> wrote: > - most importantly, the parameter re-estimation step currently is a maximum > likelihood re-estimation. So the algorithm is not guaranteed to do actual > training/converge/work at all. In order to do Gibbs sampling, proper > Bayesian re-estimation is required. > Thanks. I had a suspicion about this, but never had time to check in detail. > - the non-parametric aspect is lost by requiring the max number of > clusters > and the alpha_0 parameters as inputs to the algorithm; the whole point of > the DPMM is that the number of clusters is not predefined, but given by the > data. I'd like to fix this. > This is only partially true. By setting the max number of clusters to a large number, you get a good approximation to the truly non-parametric solution. Setting alpha_0 is more of a hack. It would be better to sample it to provide a fully Gibbs' sampler, but this is not all that critical in many situations since we can also do MAP estimation of good values for alpha_0 and use that. > - the normalization procedures for normalizing a vector of probabilities > divide each component by the max of the components, instead of their sum > Good point. > - generalize probability distributions in dimension and priors, eg the > normal seems limited to a 2D ยต=(0,0), sigma=1 prior > Also a good idea. I had thought that the normal distribution was only an example. It should definitely be generalized if not. Commons math has the beginnings of a good abstraction for distributions. I would recommend working with them. > - change terminology to "standard" machine learning terms, add a > mathematical description of what's happening in the algorithm > Fabulous. > What do you think ? > Go for it. The best next step is to file several JIRA issues and start adding patches. Mahout moves very fast so good patches will get constructive criticism or be committed very quickly. -- Ted Dunning, CTO DeepDyve