hi Staszek, Thank you very much for your advice. My problem has been solved. It is caused by the regexp in the stoplables.en. I didn't released that regular expression is required in order to filter out the words. I have add in the regexp in my stoplabels.en and it works like a charm.
-GC On Wed, Sep 9, 2009 at 3:34 AM, Stanislaw Osinski <stac...@gmail.com> wrote: > Hi, > > It seems like the problem can be on two layers: 1) getting the right > contents of stop* files for Carrot2, 2) making sure Solr picks up the > changes. > > I tried your quick and dirty hack too. It didn't work also. phase like > > "Carbon Atoms in the Group" with "in" still appear in my clustering > labels. > > > > Here most probably layer 1) applies: if you add "in" to stopwords, the > Lingo > algorithm (Carrot2's default) will still create labels with "in" inside, > but > will not create labels starting / ending in "in". If you'd like to > eliminate > "in" completely, you'd need to put an appropriate regexp in stoplabels.*. > > For more details, please see Carrot2 manual: > > > http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words > > http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps > > The easiest way to tune the stopwords and see their impact on clusters is > to > use Carrot2 Document Clustering Workbench (see > http://wiki.apache.org/solr/ClusteringComponent). > > > > What i did is, > > > > 1. use "java uf carrot2-mini.jar stoplabels.en" command to replace the > > stoplabel.en file. > > 2. apply clustering patch. re-complie the solr with the new > > carrot2-mini.jar. > > 3. deploy the new apache-solr-1.4-dev.war to tomcat. > > > > Once you make sure the changes to stopwords.* and stoplabels.* have the > desired effect on clusters, the above procedure should do the trick. You > can > also put the modified files in WEB-INF/classes of the WAR, if that's any > easier. > > For your reference, I've updated > http://wiki.apache.org/solr/ClusteringComponent to contain a procedure > working with the Jetty starter distributed in Solr's examples folder. > > > > <searchComponent > > class="org.apache.solr.handler.clustering.ClusteringComponent" > > name="clustering"> > > <lst name="engine"> > > <str name="name">default</str> > > <str > > > > > name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str> > > <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str> > > <float name="carrot.lingo.threshold.clusterAssignment">0.150</float> > > <float > > name="carrot.lingo.threshold.candidateClusterThreshold">0.775</float> > > > > Not really related to your issue, but the above file looks a little > outdated > -- the two parameters:"carrot.lingo.threshold.clusterAssignment" and > "carrot.lingo.threshold.candidateClusterThreshold" are not there anymore > (but there are many others: > http://download.carrot2.org/stable/manual/#section.component.lingo). For > most up to date examples, please see > http://wiki.apache.org/solr/ClusteringComponent and solrconfig.xml in > contrib\clustering\example\conf. > > Cheers, > > Staszek >