Re: assit with the Clustering component in Solr/Lucene
Thanks much Stan, Ramdev On May 16, 2011, at 11:38 AM, Stanislaw Osinski wrote: Both of the clustering algorithms that ship with Solr (Lingo and STC) are designed to allow one document to appear in more than one cluster, which actually does make sense in many scenarios. There's no easy way to force them to produce hard clusterings because this would require a complete change in the way the algorithms work. If you need each document to belong to exactly one cluster, you'd have to post-process the clusters to remove the redundant document assignments. On the second thought, I have a simple implementation of k-means clustering that could do hard clustering for you. It's not available yet, it will most probably be part of the next major release of Carrot2 (the package that does the clustering). Please watch this issue http://issues.carrot2.org/browse/CARROT-791 to get updates on this. Just to let you know: Carrot2 3.5.0 has landed in Solr trunk and branch_3x, so you can use the bisecting k-means clustering algorithm (org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm) which will produce non-overlapping clusters for you. The downside of this simple implementation of k-means is that, for the time being, it produces one-word cluster labels rather than phrases as Lingo and STC. Cheers, S.
Re: assit with the Clustering component in Solr/Lucene
> > Both of the clustering algorithms that ship with Solr (Lingo and STC) are >> designed to allow one document to appear in more than one cluster, which >> actually does make sense in many scenarios. There's no easy way to force >> them to produce hard clusterings because this would require a complete >> change in the way the algorithms work. If you need each document to belong >> to exactly one cluster, you'd have to post-process the clusters to remove >> the redundant document assignments. >> > > On the second thought, I have a simple implementation of k-means clustering > that could do hard clustering for you. It's not available yet, it will most > probably be part of the next major release of Carrot2 (the package that does > the clustering). Please watch this issue > http://issues.carrot2.org/browse/CARROT-791 to get updates on this. > Just to let you know: Carrot2 3.5.0 has landed in Solr trunk and branch_3x, so you can use the bisecting k-means clustering algorithm (org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm) which will produce non-overlapping clusters for you. The downside of this simple implementation of k-means is that, for the time being, it produces one-word cluster labels rather than phrases as Lingo and STC. Cheers, S.
Re: assit with the Clustering component in Solr/Lucene
Thanks for the confirmation, I'll take a look at the issue. S. On Thu, Mar 31, 2011 at 17:24, wrote: > That did make a difference, I now see the exact number of cluster i see > from the workbench. > I am of course interested in why the config changes did not have much > effect. However, I am happy that by adding the threshold to my request URL > produces the desired results > > let me know if I can do any more tests and I will do so. Thanks much > > Ramdev > > > > On Mar 31, 2011, at 10:18 AM, Stanislaw Osinski wrote: > > > I added the parameter as you suggested. >> (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent >> section that describes the Clustering module >> Changing the value of the parameter did not have any effect on my search >> results. >> >> However, when I used the Carrot2 workbench, I could see the effect of >> changing the value. (from 6 clusters it went down to 2 clusters) >> > > Interesting... Can you, for the sake of debugging, append > &LingoClusteringAlgorithm.clusterMergingThreshold=0.0 to your request URL? > > S. > > >
Re: assit with the Clustering component in Solr/Lucene
That did make a difference, I now see the exact number of cluster i see from the workbench. I am of course interested in why the config changes did not have much effect. However, I am happy that by adding the threshold to my request URL produces the desired results let me know if I can do any more tests and I will do so. Thanks much Ramdev On Mar 31, 2011, at 10:18 AM, Stanislaw Osinski wrote: I added the parameter as you suggested. (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent section that describes the Clustering module Changing the value of the parameter did not have any effect on my search results. However, when I used the Carrot2 workbench, I could see the effect of changing the value. (from 6 clusters it went down to 2 clusters) Interesting... Can you, for the sake of debugging, append &LingoClusteringAlgorithm.clusterMergingThreshold=0.0 to your request URL? S.
Re: assit with the Clustering component in Solr/Lucene
> I added the parameter as you suggested. > (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent > section that describes the Clustering module > Changing the value of the parameter did not have any effect on my search > results. > > However, when I used the Carrot2 workbench, I could see the effect of > changing the value. (from 6 clusters it went down to 2 clusters) > Interesting... Can you, for the sake of debugging, append &LingoClusteringAlgorithm.clusterMergingThreshold=0.0 to your request URL? S.
Re: assit with the Clustering component in Solr/Lucene
Hi Staszek: I added the parameter as you suggested. (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent section that describes the Clustering module Changing the value of the parameter did not have any effect on my search results. However, when I used the Carrot2 workbench, I could see the effect of changing the value. (from 6 clusters it went down to 2 clusters) here is the XML snippet for the searchComponent: default org.carrot2.clustering.lingo.LingoClusteringAlgorithm 20 0.0 I would appreciate any insights into this behavior. Thanks Ramdev On Mar 30, 2011, at 11:51 AM, Stanislaw Osinski wrote: Hi Ramdev, Both of the clustering algorithms that ship with Solr (Lingo and STC) are designed to allow one document to appear in more than one cluster, which actually does make sense in many scenarios. There's no easy way to force them to produce hard clusterings because this would require a complete change in the way the algorithms work. If you need each document to belong to exactly one cluster, you'd have to post-process the clusters to remove the redundant document assignments. Alternatively, in case of the Lingo algorithm, you can try lowering the "LingoClusteringAlgorithm.clusterMergingThreshold" to some value in the range of 0.2--0.5. If you do that, clusters containing overlapping documents will get merged. For more information about this attribute, see here: http://download.carrot2.org/stable/manual/#section.attribute.LingoClusteringAlgorithm.clusterMergingThreshold. Cheers, Staszek On Wed, Mar 30, 2011 at 18:21, Markus Jelsma wrote: Yes, you can set engine specific parameters. Check the comments in your snippety. > Hi: > I recently included the CLustering component into Solr and updated the > requestHandler accordingly (in solrconfig.xml). Snippet of the Config for > the CLuserting: > >name="clusteringComponent" > enable="${solr.clustering.enabled:false}" > class="org.apache.solr.handler.clustering.ClusteringComponent" > > > > > default > > name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgori > thm > 20 > > > stc > name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm< > /str> > > > snippet of the Config for requestHandler >default="true"> > >explicit > >true >default >true > >headline >pi > >headline > >true > > > >false > > > clusteringComponent > > > > > When I perform a search, I see that the Cluster section within the Solr > results shows me results that are not quite consistent. There are two > documents that are reported in two different documents > > Are there parameters that can be set that will prevent this from happening > ? > > > Thanks much > > Ramdev
Re: assit with the Clustering component in Solr/Lucene
> Both of the clustering algorithms that ship with Solr (Lingo and STC) are > designed to allow one document to appear in more than one cluster, which > actually does make sense in many scenarios. There's no easy way to force > them to produce hard clusterings because this would require a complete > change in the way the algorithms work. If you need each document to belong > to exactly one cluster, you'd have to post-process the clusters to remove > the redundant document assignments. > On the second thought, I have a simple implementation of k-means clustering that could do hard clustering for you. It's not available yet, it will most probably be part of the next major release of Carrot2 (the package that does the clustering). Please watch this issue http://issues.carrot2.org/browse/CARROT-791 to get updates on this. Cheers, S.
Re: assit with the Clustering component in Solr/Lucene
Hi Ramdev, Both of the clustering algorithms that ship with Solr (Lingo and STC) are designed to allow one document to appear in more than one cluster, which actually does make sense in many scenarios. There's no easy way to force them to produce hard clusterings because this would require a complete change in the way the algorithms work. If you need each document to belong to exactly one cluster, you'd have to post-process the clusters to remove the redundant document assignments. Alternatively, in case of the Lingo algorithm, you can try lowering the "LingoClusteringAlgorithm.clusterMergingThreshold" to some value in the range of 0.2--0.5. If you do that, clusters containing overlapping documents will get merged. For more information about this attribute, see here: http://download.carrot2.org/stable/manual/#section.attribute.LingoClusteringAlgorithm.clusterMergingThreshold . Cheers, Staszek On Wed, Mar 30, 2011 at 18:21, Markus Jelsma wrote: > Yes, you can set engine specific parameters. Check the comments in your > snippety. > > > Hi: > > I recently included the CLustering component into Solr and updated the > > requestHandler accordingly (in solrconfig.xml). Snippet of the Config for > > the CLuserting: > > > >> name="clusteringComponent" > > enable="${solr.clustering.enabled:false}" > > class="org.apache.solr.handler.clustering.ClusteringComponent" > > > > > > > > > default > > > >> > name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgori > > thm > >name="LingoClusteringAlgorithm.desiredClusterCountBase">20 > > > > > > stc > >> > name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm< > > /str> > > > > > > snippet of the Config for requestHandler > >> default="true"> > > > >explicit > > > >true > >default > >true > > > >headline > >pi > > > >headline > > > >true > > > > > > > >false > > > > > > clusteringComponent > > > > > > > > > > When I perform a search, I see that the Cluster section within the Solr > > results shows me results that are not quite consistent. There are two > > documents that are reported in two different documents > > > > Are there parameters that can be set that will prevent this from > happening > > ? > > > > > > Thanks much > > > > Ramdev >
Re: assit with the Clustering component in Solr/Lucene
Yes, you can set engine specific parameters. Check the comments in your snippety. > Hi: > I recently included the CLustering component into Solr and updated the > requestHandler accordingly (in solrconfig.xml). Snippet of the Config for > the CLuserting: > >name="clusteringComponent" > enable="${solr.clustering.enabled:false}" > class="org.apache.solr.handler.clustering.ClusteringComponent" > > > > > default > >name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgori > thm > 20 > > > stc >name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm< > /str> > > > snippet of the Config for requestHandler >default="true"> > >explicit > >true >default >true > >headline >pi > >headline > >true > > > >false > > > clusteringComponent > > > > > When I perform a search, I see that the Cluster section within the Solr > results shows me results that are not quite consistent. There are two > documents that are reported in two different documents > > Are there parameters that can be set that will prevent this from happening > ? > > > Thanks much > > Ramdev