I don't think that there is a single best approach.  Here in rough order are
the approaches that I would try:

1) carrot2 may be able to handle a few thousand documents.  If so, you
should be pretty well off because their clustering is typically pretty good
and it shows even better than it is because it makes inspectability a
priority.  carrot2 is not, however, designed to scale to very large
collections according to Dawid.

2) k-means on tf-idf weighted words and bigrams, especially with the
k-means++ starting point that isn't available yet.  This should also be
pretty good, but may show a bit worse than it really is because it doesn't
use the tricks carrot uses.

3) k-means on SVD representations of documents.  Same comments as (2) except
that this option depends on *two* pieces of unreleased code (k-means++ and
SVD) instead of one.  At this size, you should be able to avoid using the
mega-SVD that Jake just posted, but that will mean you need to write your
own interlude between the document vectorizer and a normal in-memory SVD.
You may have some cooler results here in that documents that share no words
might be seen as similar, but I expect that overall results for this should
be very similar as for (2).

4) if you aren't happy by now, invoke plan B and come up with new ideas.
These might include raw LDA, k-means on LDA document vectors and k-means
with norms other than L_2 and L_1.

All of this changes a bit if you have some labeled documents.  K-means
should be pretty easy to extend to deal with that and it can dramatically
improve results.


On Wed, Feb 10, 2010 at 6:04 PM, Ken Krugler <[email protected]>wrote:

> Give the code currently in Mahout (+ Lucene), is there a generally accepted
> best approach for clustering of documents?




-- 
Ted Dunning, CTO
DeepDyve

Reply via email to