Re: Practical Advice on Clustering

Grant Ingersoll Fri, 17 Oct 2008 11:43:04 -0700


On Oct 13, 2008, at 12:17 AM, Vaijanath N. Rao wrote:

Hi Grant,

My replies are inline.

Grant Ingersoll wrote:
I'm looking into adding document clustering capabilities to Solr,using Mahout [1][2]. I already have search-results clustering,thanks to Carrot2. What I'm looking for is practical advice ondeploying a system that is going to cluster potentially largecorpora (but not huge, and let's assume one machine for now, but itshouldn't matter)
Here are some thoughts I have:
In Solr, I expect to send a request to go off and build theclusters for some non-trivial set of documents in the index. Theactual building needs to happen in a background thread, so as tonot hold up the caller.
Bingo It's better to spawn a new process for clustering rather thanto hold up the caller. If there is a status page indicating thestatus of this clustering algorithm it would be better as the callercan than check against this status page to know what the currentstatus is.
My thinking is the request will come in and spawn off a job thatgoes and calculates a similarity matrix for all the documents inthe set (need to store the term vectors in Lucene) and then goesand runs the clustering job (user configurable, based on theimplementations we have: k-means, mean-shift, fuzzy, whatever) andstores the results into Solr's data directory somehow (so that itcan be replicated, but not a big concern of mine at the moment)
If we are going to work on similarity matrix would like to add FIHC(Frequent Item set Hierarchical clustering) If you need I candefinitely pitch in with this. Ideally we should target replicationand I think the idea is good.

I'm open for anything. I figured start with the simplest, but if youhave references, that would be cool.

Then, at any time, the application can ask Solr for the clusters(whatever that means) and it will return them (docids, fields,whatever the app asks for). If the background task isn't done yet,the results set will be empty, or it will return a percentagecompletion or something useful.
In my opinion it is better to return the percentage of completionrather than the top clusters at time X if the clustering is not yetover. In most clustering cases the input data decides the centroidof the clusters so change in input mite change the centroid and youmite get different results for different input sample derived fromsame data set.

Yeah, I think percent complete is good, also will keep the amount oftraffic down. But, in true Solr option, maybe it can be optional tosend partial clusters, too.

Obviously, my first step is to get it working, but...
Is it practical to return a partially done set of results? i.e.the best clusters so far, with perhaps a percentage to completionvalue or perhaps a list of the comparisons that haven't been doneyet?
What if something happens? How can I make Mahout fault-tolerant,such that, conceivably I could pick up the job again from where itwent down, or at least be able to get the clusters so far. How dopeople approach this to date (w/ or w/o Mahout) What needs to bedone in Mahout to make this possible? I suspect Hadoop has somesupport for it.
Not sure weather Mahoot is fault tolerant in that respect. But Iguess other members can comment on this.

No, I don't think it is at the moment in this respect. I mean, Hadoophas it, so it probably isn't that hard to add...

Re: Practical Advice on Clustering

Reply via email to