Bogdan,

Recent resolution of MAHOUT-251 should allow you to experiment with Dirichlet clustering for text models with arbitrary dimensionality. I suggest starting with the NormalModelDistribution with a large sparse vector as its prototype. The other model distributions create sampled values for all the prior model dimensions, negating any value of using sparse vectors for their prototypes.

It may in fact be necessary to introduce a new ModelDistribution and Model so that sparse model elements will not fill up with insignificant values. After the first iteration computes the new posterior model parameters from the observations, many of these values will likely be small so some heuristic would be needed to preserve model sparseness by removing them altogether. If all these values are retained, it is probably better to use a dense vector representation. A 50k-dimensional model will be a real compute hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or sample() would be good places to embed this heuristic.

I'll begin writing some tests to experiment with these models.


Reply via email to