Well, dimensions - I am just using slightly modified version of LuceneDriver (added stopword removal and regex removal of incoming terms), so I guess it is just a list of unidimentional vectors of random length. I will try to run the new code tomorrow.
On Mon, Jan 18, 2010 at 10:18 PM, Jeff Eastman <[email protected]>wrote: > Yes, they're all in trunk. Just do an svn update and mvn install to get > them. > > BTW, what's the dimensionality of your data? > > Jeff > > > > Bogdan Vatkov wrote: > >> Hi Jeff, >> >> I will try with the NormalModelDistribution but I am wondering how to >> obtain >> "MAHOUT-251", is this a tag in the SVN or how it is? how can I get the >> source containing the changes, do I simply sync from trunk? I suppose I >> have >> to run mvn install after that, right? >> >> Best regards, >> Bogdan >> >> On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman <[email protected] >> >wrote: >> >> >> >>> Bogdan, >>> >>> Recent resolution of MAHOUT-251 should allow you to experiment with >>> Dirichlet clustering for text models with arbitrary dimensionality. I >>> suggest starting with the NormalModelDistribution with a large sparse >>> vector >>> as its prototype. The other model distributions create sampled values >>> for >>> all the prior model dimensions, negating any value of using sparse >>> vectors >>> for their prototypes. >>> >>> It may in fact be necessary to introduce a new ModelDistribution and >>> Model >>> so that sparse model elements will not fill up with insignificant values. >>> After the first iteration computes the new posterior model parameters >>> from >>> the observations, many of these values will likely be small so some >>> heuristic would be needed to preserve model sparseness by removing them >>> altogether. If all these values are retained, it is probably better to >>> use a >>> dense vector representation. A 50k-dimensional model will be a real >>> compute >>> hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or >>> sample() would be good places to embed this heuristic. >>> >>> I'll begin writing some tests to experiment with these models. >>> >>> >>> >>> >>> >> >> >> >> > > -- Best regards, Bogdan
