Hi Ted,
Thanks much for the useful response. A few comments/questions inline
below...
On Feb 10, 2010, at 6:27pm, Ted Dunning wrote:
I don't think that there is a single best approach. Here in rough
order are
the approaches that I would try:
1) carrot2 may be able to handle a few thousand documents. If so, you
should be pretty well off because their clustering is typically
pretty good
and it shows even better than it is because it makes inspectability a
priority. carrot2 is not, however, designed to scale to very large
collections according to Dawid.
We originally tried Carrot2, and the results weren't bad. But one
additional fact I should have mentioned is that I need to be able to
use the resulting clusters to do subsequent classification of future
documents. From what I could tell w/Carrot2 and perusing the mailing
list, this isn't really feasible due to the lack of a centroid from
the Carrot2 clusters.
2) k-means on tf-idf weighted words and bigrams, especially with the
k-means++ starting point that isn't available yet. This should also
be
pretty good, but may show a bit worse than it really is because it
doesn't
use the tricks carrot uses.
Is there any support currently in Mahout for generating tf-idf
weighted vectors without creating a Lucene index? Just curious.
3) k-means on SVD representations of documents. Same comments as
(2) except
that this option depends on *two* pieces of unreleased code (k-means+
+ and
SVD) instead of one. At this size, you should be able to avoid
using the
mega-SVD that Jake just posted, but that will mean you need to write
your
own interlude between the document vectorizer and a normal in-memory
SVD.
You may have some cooler results here in that documents that share
no words
might be seen as similar, but I expect that overall results for this
should
be very similar as for (2).
I assume you'd use something like the Lucene ShingleAnalyzer to
generate one and two word terms.
4) if you aren't happy by now, invoke plan B and come up with new
ideas.
These might include raw LDA, k-means on LDA document vectors and k-
means
with norms other than L_2 and L_1.
All of this changes a bit if you have some labeled documents. K-means
should be pretty easy to extend to deal with that and it can
dramatically
improve results.
I've experimented with using just the results of the Zemanta API,
which generates keywords from (I believe) calculating similarity with
sets of Wikipedia pages that share a similar category.
But clustering these keyword-only vectors gave marginal results,
mostly for cases where the Zemanta results were questionable.
What approach would you take to mix-in keywords such as these with raw
document data?
Thanks again!
-- Ken
On Wed, Feb 10, 2010 at 6:04 PM, Ken Krugler <[email protected]
>wrote:
Give the code currently in Mahout (+ Lucene), is there a generally
accepted
best approach for clustering of documents?
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g