On Oct 13, 2008, at 12:17 AM, Vaijanath N. Rao wrote:
Hi Grant,
My replies are inline.
Grant Ingersoll wrote:
I'm looking into adding document clustering capabilities to Solr,
using Mahout [1][2]. I already have search-results clustering,
thanks to Carrot2. What I'm looking for is practical advice on
deploying a system that is going to cluster potentially large
corpora (but not huge, and let's assume one machine for now, but it
shouldn't matter)
Here are some thoughts I have:
In Solr, I expect to send a request to go off and build the
clusters for some non-trivial set of documents in the index. The
actual building needs to happen in a background thread, so as to
not hold up the caller.
Bingo It's better to spawn a new process for clustering rather than
to hold up the caller. If there is a status page indicating the
status of this clustering algorithm it would be better as the caller
can than check against this status page to know what the current
status is.
My thinking is the request will come in and spawn off a job that
goes and calculates a similarity matrix for all the documents in
the set (need to store the term vectors in Lucene) and then goes
and runs the clustering job (user configurable, based on the
implementations we have: k-means, mean-shift, fuzzy, whatever) and
stores the results into Solr's data directory somehow (so that it
can be replicated, but not a big concern of mine at the moment)
If we are going to work on similarity matrix would like to add FIHC
(Frequent Item set Hierarchical clustering) If you need I can
definitely pitch in with this. Ideally we should target replication
and I think the idea is good.
I'm open for anything. I figured start with the simplest, but if you
have references, that would be cool.
Then, at any time, the application can ask Solr for the clusters
(whatever that means) and it will return them (docids, fields,
whatever the app asks for). If the background task isn't done yet,
the results set will be empty, or it will return a percentage
completion or something useful.
In my opinion it is better to return the percentage of completion
rather than the top clusters at time X if the clustering is not yet
over. In most clustering cases the input data decides the centroid
of the clusters so change in input mite change the centroid and you
mite get different results for different input sample derived from
same data set.
Yeah, I think percent complete is good, also will keep the amount of
traffic down. But, in true Solr option, maybe it can be optional to
send partial clusters, too.
Obviously, my first step is to get it working, but...
Is it practical to return a partially done set of results? i.e.
the best clusters so far, with perhaps a percentage to completion
value or perhaps a list of the comparisons that haven't been done
yet?
What if something happens? How can I make Mahout fault-tolerant,
such that, conceivably I could pick up the job again from where it
went down, or at least be able to get the clusters so far. How do
people approach this to date (w/ or w/o Mahout) What needs to be
done in Mahout to make this possible? I suspect Hadoop has some
support for it.
Not sure weather Mahoot is fault tolerant in that respect. But I
guess other members can comment on this.
No, I don't think it is at the moment in this respect. I mean, Hadoop
has it, so it probably isn't that hard to add...