I don't think that there is a single best approach. Here in rough order are the approaches that I would try:
1) carrot2 may be able to handle a few thousand documents. If so, you should be pretty well off because their clustering is typically pretty good and it shows even better than it is because it makes inspectability a priority. carrot2 is not, however, designed to scale to very large collections according to Dawid. 2) k-means on tf-idf weighted words and bigrams, especially with the k-means++ starting point that isn't available yet. This should also be pretty good, but may show a bit worse than it really is because it doesn't use the tricks carrot uses. 3) k-means on SVD representations of documents. Same comments as (2) except that this option depends on *two* pieces of unreleased code (k-means++ and SVD) instead of one. At this size, you should be able to avoid using the mega-SVD that Jake just posted, but that will mean you need to write your own interlude between the document vectorizer and a normal in-memory SVD. You may have some cooler results here in that documents that share no words might be seen as similar, but I expect that overall results for this should be very similar as for (2). 4) if you aren't happy by now, invoke plan B and come up with new ideas. These might include raw LDA, k-means on LDA document vectors and k-means with norms other than L_2 and L_1. All of this changes a bit if you have some labeled documents. K-means should be pretty easy to extend to deal with that and it can dramatically improve results. On Wed, Feb 10, 2010 at 6:04 PM, Ken Krugler <[email protected]>wrote: > Give the code currently in Mahout (+ Lucene), is there a generally accepted > best approach for clustering of documents? -- Ted Dunning, CTO DeepDyve
