I'm looking into adding document clustering capabilities to Solr, using Mahout [1][2]. I already have search-results clustering, thanks to Carrot2. What I'm looking for is practical advice on deploying a system that is going to cluster potentially large corpora (but not huge, and let's assume one machine for now, but it shouldn't matter)

Here are some thoughts I have:

In Solr, I expect to send a request to go off and build the clusters for some non-trivial set of documents in the index. The actual building needs to happen in a background thread, so as to not hold up the caller. My thinking is the request will come in and spawn off a job that goes and calculates a similarity matrix for all the documents in the set (need to store the term vectors in Lucene) and then goes and runs the clustering job (user configurable, based on the implementations we have: k-means, mean-shift, fuzzy, whatever) and stores the results into Solr's data directory somehow (so that it can be replicated, but not a big concern of mine at the moment)

Then, at any time, the application can ask Solr for the clusters (whatever that means) and it will return them (docids, fields, whatever the app asks for). If the background task isn't done yet, the results set will be empty, or it will return a percentage completion or something useful.

Obviously, my first step is to get it working, but...

Is it practical to return a partially done set of results? i.e. the best clusters so far, with perhaps a percentage to completion value or perhaps a list of the comparisons that haven't been done yet?

What if something happens? How can I make Mahout fault-tolerant, such that, conceivably I could pick up the job again from where it went down, or at least be able to get the clusters so far. How do people approach this to date (w/ or w/o Mahout) What needs to be done in Mahout to make this possible? I suspect Hadoop has some support for it.

Anything else I don't know?  Does what I'm thinking about make sense?

Thanks for any insight,
Grant



[1] http://wiki.apache.org/solr/ClusteringComponent
[2] https://issues.apache.org/jira/browse/SOLR-769

Reply via email to