Re: Suggestions for best approach to classic document clustering

Ken Krugler Wed, 10 Feb 2010 19:29:40 -0800

Hi Ted,

Thanks much for the useful response. A few comments/questions inlinebelow...


On Feb 10, 2010, at 6:27pm, Ted Dunning wrote:

I don't think that there is a single best approach. Here in roughorder are
the approaches that I would try:

1) carrot2 may be able to handle a few thousand documents.  If so, you
should be pretty well off because their clustering is typicallypretty good
and it shows even better than it is because it makes inspectability a
priority.  carrot2 is not, however, designed to scale to very large
collections according to Dawid.

We originally tried Carrot2, and the results weren't bad. But oneadditional fact I should have mentioned is that I need to be able touse the resulting clusters to do subsequent classification of futuredocuments. From what I could tell w/Carrot2 and perusing the mailinglist, this isn't really feasible due to the lack of a centroid fromthe Carrot2 clusters.

2) k-means on tf-idf weighted words and bigrams, especially with the
k-means++ starting point that isn't available yet. This should alsobepretty good, but may show a bit worse than it really is because itdoesn't
use the tricks carrot uses.

Is there any support currently in Mahout for generating tf-idfweighted vectors without creating a Lucene index? Just curious.

3) k-means on SVD representations of documents. Same comments as(2) exceptthat this option depends on *two* pieces of unreleased code (k-means++ andSVD) instead of one. At this size, you should be able to avoidusing themega-SVD that Jake just posted, but that will mean you need to writeyourown interlude between the document vectorizer and a normal in-memorySVD.You may have some cooler results here in that documents that shareno wordsmight be seen as similar, but I expect that overall results for thisshould
be very similar as for (2).

I assume you'd use something like the Lucene ShingleAnalyzer togenerate one and two word terms.

4) if you aren't happy by now, invoke plan B and come up with newideas.These might include raw LDA, k-means on LDA document vectors and k-means
with norms other than L_2 and L_1.

All of this changes a bit if you have some labeled documents.  K-means
should be pretty easy to extend to deal with that and it candramatically
improve results.

I've experimented with using just the results of the Zemanta API,which generates keywords from (I believe) calculating similarity withsets of Wikipedia pages that share a similar category.

But clustering these keyword-only vectors gave marginal results,mostly for cases where the Zemanta results were questionable.

What approach would you take to mix-in keywords such as these with rawdocument data?


Thanks again!

-- Ken

On Wed, Feb 10, 2010 at 6:04 PM, Ken Krugler <[email protected]>wrote:
Give the code currently in Mahout (+ Lucene), is there a generallyaccepted
best approach for clustering of documents?


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Suggestions for best approach to classic document clustering

Reply via email to