Gang,

What's the state of the world on clustering a raft of textual
documents? Are all the pieces in place to start from a directory of
flat text files, push through Lucene to get the vectors, keep labels
on the vectors to point back to the files, and run, say, k-means?

I've got enough data here that skimming off the top few unigrams might
also be advisable.

I tried running this through Weka, and blew it out of virtual memory.

--benson

Reply via email to