hi Staszek,

Thank you very much for your advice. My problem has been solved. It is
caused by the regexp in the stoplables.en. I didn't released that regular
expression is required in order to filter out the words. I have add in the
regexp in my stoplabels.en and it works like a charm.

-GC

On Wed, Sep 9, 2009 at 3:34 AM, Stanislaw Osinski <stac...@gmail.com> wrote:

> Hi,
>
> It seems like the problem can be on two layers: 1) getting the right
> contents of stop* files for Carrot2, 2) making sure Solr picks up the
> changes.
>
> I tried your quick and dirty hack too. It didn't work also. phase like
> > "Carbon Atoms in the Group" with "in" still appear in my clustering
> labels.
> >
>
> Here most probably layer 1) applies: if you add "in" to stopwords, the
> Lingo
> algorithm (Carrot2's default) will still create labels with "in" inside,
> but
> will not create labels starting / ending in "in". If you'd like to
> eliminate
> "in" completely, you'd need to put an appropriate regexp in stoplabels.*.
>
> For more details, please see Carrot2 manual:
>
>
> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
>
> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps
>
> The easiest way to tune the stopwords and see their impact on clusters is
> to
> use Carrot2 Document Clustering Workbench (see
> http://wiki.apache.org/solr/ClusteringComponent).
>
>
> > What i did is,
> >
> > 1. use "java uf carrot2-mini.jar stoplabels.en" command to replace the
> > stoplabel.en file.
> > 2. apply clustering patch. re-complie the solr with the new
> > carrot2-mini.jar.
> > 3. deploy the new apache-solr-1.4-dev.war to tomcat.
> >
>
> Once you make sure the changes to stopwords.* and stoplabels.* have the
> desired effect on clusters, the above procedure should do the trick. You
> can
> also put the modified files in WEB-INF/classes of the WAR, if that's any
> easier.
>
> For your reference, I've updated
> http://wiki.apache.org/solr/ClusteringComponent to contain a procedure
> working with the Jetty starter distributed in Solr's examples folder.
>
>
> > <searchComponent
> > class="org.apache.solr.handler.clustering.ClusteringComponent"
> > name="clustering">
> >  <lst name="engine">
> >    <str name="name">default</str>
> >    <str
> >
> >
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
> >    <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
> >    <float name="carrot.lingo.threshold.clusterAssignment">0.150</float>
> >    <float
> > name="carrot.lingo.threshold.candidateClusterThreshold">0.775</float>
> >
>
> Not really related to your issue, but the above file looks a little
> outdated
> -- the two parameters:"carrot.lingo.threshold.clusterAssignment" and
> "carrot.lingo.threshold.candidateClusterThreshold" are not there anymore
> (but there are many others:
> http://download.carrot2.org/stable/manual/#section.component.lingo). For
> most up to date examples, please see
> http://wiki.apache.org/solr/ClusteringComponent and solrconfig.xml in
> contrib\clustering\example\conf.
>
> Cheers,
>
> Staszek
>

Reply via email to