I think you will need to bound your model dimensionality to use Dirichlet. If you are using TF-IDF vectors to represent your documents I would think these would all have the same maximum cardinality which you could specify for the modelPrototype size. I just committed a new model distribution (SparseNormalModelDistribution) that includes a heuristic sampleFromPosterior() to remove small mean element values to preserve model sparseness. It's probably bogus but a place to begin.

I have also written one new unit test that runs in memory over a small, 50-d sparse model and 100, 50-d sparse vectors. It does not explode.

Just do another update before you begin to pick up those changes.


Bogdan Vatkov wrote:
Well, dimensions - I am just using slightly modified version of LuceneDriver
(added stopword removal and regex removal of incoming terms), so I guess it
is just a list of unidimentional vectors of random length.
I will try to run the new code tomorrow.

On Mon, Jan 18, 2010 at 10:18 PM, Jeff Eastman
<[email protected]>wrote:

Yes, they're all in trunk. Just do an svn update and mvn install to get
them.

BTW, what's the dimensionality of your data?

Jeff



Bogdan Vatkov wrote:

Hi Jeff,

I will try with the NormalModelDistribution but I am wondering how to
obtain
"MAHOUT-251", is this a tag in the SVN or how it is? how can I get the
source containing the changes, do I simply sync from trunk? I suppose I
have
to run mvn install after that, right?

Best regards,
Bogdan

On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman <[email protected]
wrote:

Bogdan,

Recent resolution of MAHOUT-251 should allow you to experiment with
Dirichlet clustering for text models with arbitrary dimensionality. I
suggest starting with the NormalModelDistribution with a large sparse
vector
as its prototype.  The other model distributions create sampled values
for
all the prior model dimensions, negating any value of using sparse
vectors
for their prototypes.

It may in fact be necessary to introduce a new ModelDistribution and
Model
so that sparse model elements will not fill up with insignificant values.
After the first iteration computes the new posterior model parameters
from
the observations, many of these values will likely be small so some
heuristic would be needed to preserve model sparseness by removing them
altogether. If all these values are retained, it is probably better to
use a
dense vector representation. A 50k-dimensional model will be a real
compute
hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or
sample() would be good places to embed this heuristic.

I'll begin writing some tests to experiment with these models.










Reply via email to