Bogdan,
Recent resolution of MAHOUT-251 should allow you to experiment with
Dirichlet clustering for text models with arbitrary dimensionality. I
suggest starting with the NormalModelDistribution with a large sparse
vector as its prototype. The other model distributions create sampled
values for all the prior model dimensions, negating any value of using
sparse vectors for their prototypes.
It may in fact be necessary to introduce a new ModelDistribution and
Model so that sparse model elements will not fill up with insignificant
values. After the first iteration computes the new posterior model
parameters from the observations, many of these values will likely be
small so some heuristic would be needed to preserve model sparseness by
removing them altogether. If all these values are retained, it is
probably better to use a dense vector representation. A 50k-dimensional
model will be a real compute hog if it is not kept sparse somehow. Maybe
sampleFromPosterior() or sample() would be good places to embed this
heuristic.
I'll begin writing some tests to experiment with these models.