Re: CardinalityException in DirichletDriver

Jeff Eastman Mon, 18 Jan 2010 11:53:58 -0800

Bogdan,

Recent resolution of MAHOUT-251 should allow you to experiment withDirichlet clustering for text models with arbitrary dimensionality. Isuggest starting with the NormalModelDistribution with a large sparsevector as its prototype. The other model distributions create sampledvalues for all the prior model dimensions, negating any value of usingsparse vectors for their prototypes.

It may in fact be necessary to introduce a new ModelDistribution andModel so that sparse model elements will not fill up with insignificantvalues. After the first iteration computes the new posterior modelparameters from the observations, many of these values will likely besmall so some heuristic would be needed to preserve model sparseness byremoving them altogether. If all these values are retained, it isprobably better to use a dense vector representation. A 50k-dimensionalmodel will be a real compute hog if it is not kept sparse somehow. MaybesampleFromPosterior() or sample() would be good places to embed thisheuristic.


I'll begin writing some tests to experiment with these models.

Re: CardinalityException in DirichletDriver

Reply via email to