Re: assit with the Clustering component in Solr/Lucene

ramdev.wudali Mon, 16 May 2011 09:43:00 -0700

Thanks much Stan,


Ramdev

On May 16, 2011, at 11:38 AM, Stanislaw Osinski wrote:


                        Both of the clustering algorithms that ship with Solr 
(Lingo and STC) are designed to allow one document to appear in more than one 
cluster, which actually does make sense in many scenarios. There's no easy way 
to force them to produce hard clusterings because this would require a complete 
change in the way the algorithms work. If you need each document to belong to 
exactly one cluster, you'd have to post-process the clusters to remove the 
redundant document assignments.
                        


                On the second thought, I have a simple implementation of 
k-means clustering that could do hard clustering for you. It's not available 
yet, it will most probably be part of the next major release of Carrot2 (the 
package that does the clustering). Please watch this issue 
http://issues.carrot2.org/browse/CARROT-791 to get updates on this.
                


        Just to let you know: Carrot2 3.5.0 has landed in Solr trunk and 
branch_3x, so you can use the bisecting k-means clustering algorithm 
(org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm) which will 
produce non-overlapping clusters for you. The downside of this simple 
implementation of k-means is that, for the time being, it produces one-word 
cluster labels rather than phrases as Lingo and STC.

        Cheers,

        S.

Re: assit with the Clustering component in Solr/Lucene

Reply via email to