Owen Densmore wrote:

I would like to be able to analyze my document collection (~1200 documents) and discover good "buckets" of categories for them. I'm pretty sure this is termed Document Clustering .. finding the emergent clumps the documents fall naturally into judging from their term vectors.

Looking at the discussion that flared roughly a year ago (last message 2003-11-12) with the subject Document Clustering, it seems Lucene should be able to help with this. Has anyone had success with this recently?

Last year it was suggested Carrot2 could help, and it would even produce good labels for the clusters. Has this proven to be true? Our goal is to use clustering to build a nifty graphic interface, probably using Flash.

Carrot2 seems to work nicely. Demo here...

Search for something like "artificial intelligence" in my Wikipedia Search engine:

http://www.searchmorph.com/kat/wikipedia.jsp?s=artificial+intelligence

The click on "see clustered results.." link to go here:

http://www.searchmorph.com/kat/wikipedia-cluster.jsp?s=artificial%20intelligence

And voilla, what seems like decent clusters.

I'm not sure what the complexity of the algorithm is, but for me ~100 docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM.

I suggest: try it w/ ~100 docs, and if you like what you see, keep increasing the # of docs you give it. You might have to wait a while w/ all 1,200 docs...

- Dave







Thanks for any pointers.

Owen


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to