Re: Suggestions for best approach to classic document clustering

Dawid Weiss Thu, 11 Feb 2010 00:32:05 -0800

> We originally tried Carrot2, and the results weren't bad. But one additional
> fact I should have mentioned is that I need to be able to use the resulting
> clusters to do subsequent classification of future documents. From what I
> could tell w/Carrot2 and perusing the mailing list, this isn't really
> feasible due to the lack of a centroid from the Carrot2 clusters.


It's because there are no "centroids" in the sense of, let's say,
k-means. Assuming cluster centroid is an average vector of its
documents' words, you could compute them later... but it's not as
clean as having a centroid that is a result of the algorithm's
internal workings of course.

Like Ted mentioned, Carrot2 algorithms are designed to run in-memory
and with no harsh constraints on memory use (when there is a
speed-vs.memory tradeoff, we usually choose speed), so the problem
size will be bounded by maximum size of Java arrays, if nothing else.

Dawid

Re: Suggestions for best approach to classic document clustering

Reply via email to