> We originally tried Carrot2, and the results weren't bad. But one additional > fact I should have mentioned is that I need to be able to use the resulting > clusters to do subsequent classification of future documents. From what I > could tell w/Carrot2 and perusing the mailing list, this isn't really > feasible due to the lack of a centroid from the Carrot2 clusters.
It's because there are no "centroids" in the sense of, let's say, k-means. Assuming cluster centroid is an average vector of its documents' words, you could compute them later... but it's not as clean as having a centroid that is a result of the algorithm's internal workings of course. Like Ted mentioned, Carrot2 algorithms are designed to run in-memory and with no harsh constraints on memory use (when there is a speed-vs.memory tradeoff, we usually choose speed), so the problem size will be bounded by maximum size of Java arrays, if nothing else. Dawid
