Re: assit with the Clustering component in Solr/Lucene

2011-05-16 Thread ramdev.wudali
Thanks much Stan,


Ramdev

On May 16, 2011, at 11:38 AM, Stanislaw Osinski wrote:


Both of the clustering algorithms that ship with Solr 
(Lingo and STC) are designed to allow one document to appear in more than one 
cluster, which actually does make sense in many scenarios. There's no easy way 
to force them to produce hard clusterings because this would require a complete 
change in the way the algorithms work. If you need each document to belong to 
exactly one cluster, you'd have to post-process the clusters to remove the 
redundant document assignments.



On the second thought, I have a simple implementation of 
k-means clustering that could do hard clustering for you. It's not available 
yet, it will most probably be part of the next major release of Carrot2 (the 
package that does the clustering). Please watch this issue 
http://issues.carrot2.org/browse/CARROT-791 to get updates on this.



Just to let you know: Carrot2 3.5.0 has landed in Solr trunk and 
branch_3x, so you can use the bisecting k-means clustering algorithm 
(org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm) which will 
produce non-overlapping clusters for you. The downside of this simple 
implementation of k-means is that, for the time being, it produces one-word 
cluster labels rather than phrases as Lingo and STC.

Cheers,

S.






Re: assit with the Clustering component in Solr/Lucene

2011-05-16 Thread Stanislaw Osinski
>
> Both of the clustering algorithms that ship with Solr (Lingo and STC) are
>> designed to allow one document to appear in more than one cluster, which
>> actually does make sense in many scenarios. There's no easy way to force
>> them to produce hard clusterings because this would require a complete
>> change in the way the algorithms work. If you need each document to belong
>> to exactly one cluster, you'd have to post-process the clusters to remove
>> the redundant document assignments.
>>
>
> On the second thought, I have a simple implementation of k-means clustering
> that could do hard clustering for you. It's not available yet, it will most
> probably be part of the next major release of Carrot2 (the package that does
> the clustering). Please watch this issue
> http://issues.carrot2.org/browse/CARROT-791 to get updates on this.
>

Just to let you know: Carrot2 3.5.0 has landed in Solr trunk and branch_3x,
so you can use the bisecting k-means clustering algorithm
(org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm) which
will produce non-overlapping clusters for you. The downside of this simple
implementation of k-means is that, for the time being, it produces one-word
cluster labels rather than phrases as Lingo and STC.

Cheers,

S.


Re: assit with the Clustering component in Solr/Lucene

2011-03-31 Thread Stanislaw Osinski
Thanks for the confirmation, I'll take a look at the issue.

S.

On Thu, Mar 31, 2011 at 17:24,  wrote:

> That did make a difference, I now see the exact number of cluster i see
> from the workbench.
> I am of course interested in why the config changes did not have much
> effect. However, I am happy that by adding the threshold to my request URL
> produces the desired results
>
> let me know if I can do any more tests and I will do so. Thanks much
>
> Ramdev
>
>
>
> On Mar 31, 2011, at 10:18 AM, Stanislaw Osinski wrote:
>
>
>  I added the parameter as you suggested.
>> (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent
>> section that describes the Clustering module
>> Changing the value of the parameter  did not have any effect on my search
>> results.
>>
>> However, when I used the Carrot2 workbench, I could see the effect of
>> changing the value. (from 6 clusters it went down to 2 clusters)
>>
>
> Interesting... Can you, for the sake of debugging, append
> &LingoClusteringAlgorithm.clusterMergingThreshold=0.0 to your request URL?
>
> S.
>
>
>


Re: assit with the Clustering component in Solr/Lucene

2011-03-31 Thread ramdev.wudali
That did make a difference, I now see the exact number of cluster i see from 
the workbench.
I am of course interested in why the config changes did not have much effect. 
However, I am happy that by adding the threshold to my request URL produces the 
desired results

let me know if I can do any more tests and I will do so. Thanks much

Ramdev



On Mar 31, 2011, at 10:18 AM, Stanislaw Osinski wrote:



 I added the parameter as you suggested. 
(LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent 
section that describes the Clustering module
Changing the value of the parameter  did not have any effect on 
my search results.

However, when I used the Carrot2 workbench, I could see the 
effect of changing the value. (from 6 clusters it went down to 2 clusters)


Interesting... Can you, for the sake of debugging, append 
&LingoClusteringAlgorithm.clusterMergingThreshold=0.0 to your request URL?

S.






Re: assit with the Clustering component in Solr/Lucene

2011-03-31 Thread Stanislaw Osinski
>  I added the parameter as you suggested.
> (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent
> section that describes the Clustering module
> Changing the value of the parameter  did not have any effect on my search
> results.
>
> However, when I used the Carrot2 workbench, I could see the effect of
> changing the value. (from 6 clusters it went down to 2 clusters)
>

Interesting... Can you, for the sake of debugging, append
&LingoClusteringAlgorithm.clusterMergingThreshold=0.0 to your request URL?

S.


Re: assit with the Clustering component in Solr/Lucene

2011-03-31 Thread ramdev.wudali
Hi Staszek:
 I added the parameter as you suggested. 
(LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent 
section that describes the Clustering module
Changing the value of the parameter  did not have any effect on my search 
results.

However, when I used the Carrot2 workbench, I could see the effect of changing 
the value. (from 6 clusters it went down to 2 clusters)

here is the XML snippet for the searchComponent:

  


  
  default
  
  org.carrot2.clustering.lingo.LingoClusteringAlgorithm
  
  20
  0.0

  


I would appreciate any insights into this behavior. 

Thanks

Ramdev


On Mar 30, 2011, at 11:51 AM, Stanislaw Osinski wrote:


Hi Ramdev,

Both of the clustering algorithms that ship with Solr (Lingo and STC) 
are designed to allow one document to appear in more than one cluster, which 
actually does make sense in many scenarios. There's no easy way to force them 
to produce hard clusterings because this would require a complete change in the 
way the algorithms work. If you need each document to belong to exactly one 
cluster, you'd have to post-process the clusters to remove the redundant 
document assignments. Alternatively, in case of the Lingo algorithm, you can 
try lowering the "LingoClusteringAlgorithm.clusterMergingThreshold" to some 
value in the range of 0.2--0.5. If you do that, clusters containing overlapping 
documents will get merged. For more information about this attribute, see here: 
http://download.carrot2.org/stable/manual/#section.attribute.LingoClusteringAlgorithm.clusterMergingThreshold.

Cheers,

Staszek


On Wed, Mar 30, 2011 at 18:21, Markus Jelsma 
 wrote:


Yes, you can set engine specific parameters. Check the comments 
in your
snippety.


> Hi:
>   I recently included the CLustering component into Solr and 
updated the
> requestHandler accordingly (in solrconfig.xml). Snippet of 
the Config for
> the CLuserting:
>
>name="clusteringComponent"
> enable="${solr.clustering.enabled:false}"
> 
class="org.apache.solr.handler.clustering.ClusteringComponent" >
> 
> 
>   
>   default
>   
>
name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgori
> thm 
>   20
> 
> 
>   stc
>
name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm<
> /str> 
>   
>
> snippet of the Config for requestHandler
>default="true"> 
>  
>explicit
>
>true
>default
>true
>
>headline
>pi
>
>headline
>
>true
>
>
>
>false
>  
> 
>   clusteringComponent
> 
>   
>
>
> When I perform a search, I see that the Cluster section 
within the Solr
> results shows me results that are not quite consistent. There 
are two
> documents that are reported in two different documents
>
> Are there parameters that can be set that will prevent this 
from happening
> ?
>
>
> Thanks much
>
> Ramdev






Re: assit with the Clustering component in Solr/Lucene

2011-03-30 Thread Stanislaw Osinski
> Both of the clustering algorithms that ship with Solr (Lingo and STC) are
> designed to allow one document to appear in more than one cluster, which
> actually does make sense in many scenarios. There's no easy way to force
> them to produce hard clusterings because this would require a complete
> change in the way the algorithms work. If you need each document to belong
> to exactly one cluster, you'd have to post-process the clusters to remove
> the redundant document assignments.
>

On the second thought, I have a simple implementation of k-means clustering
that could do hard clustering for you. It's not available yet, it will most
probably be part of the next major release of Carrot2 (the package that does
the clustering). Please watch this issue
http://issues.carrot2.org/browse/CARROT-791 to get updates on this.

Cheers,

S.


Re: assit with the Clustering component in Solr/Lucene

2011-03-30 Thread Stanislaw Osinski
Hi Ramdev,

Both of the clustering algorithms that ship with Solr (Lingo and STC) are
designed to allow one document to appear in more than one cluster, which
actually does make sense in many scenarios. There's no easy way to force
them to produce hard clusterings because this would require a complete
change in the way the algorithms work. If you need each document to belong
to exactly one cluster, you'd have to post-process the clusters to remove
the redundant document assignments. Alternatively, in case of the Lingo
algorithm, you can try lowering the
"LingoClusteringAlgorithm.clusterMergingThreshold" to some value in the
range of 0.2--0.5. If you do that, clusters containing overlapping documents
will get merged. For more information about this attribute, see here:
http://download.carrot2.org/stable/manual/#section.attribute.LingoClusteringAlgorithm.clusterMergingThreshold
.

Cheers,

Staszek

On Wed, Mar 30, 2011 at 18:21, Markus Jelsma wrote:

> Yes, you can set engine specific parameters. Check the comments in your
> snippety.
>
> > Hi:
> >   I recently included the CLustering component into Solr and updated the
> > requestHandler accordingly (in solrconfig.xml). Snippet of the Config for
> > the CLuserting:
> >
> >> name="clusteringComponent"
> > enable="${solr.clustering.enabled:false}"
> > class="org.apache.solr.handler.clustering.ClusteringComponent" >
> > 
> > 
> >   
> >   default
> >   
> >>
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgori
> > thm 
> >name="LingoClusteringAlgorithm.desiredClusterCountBase">20
> > 
> > 
> >   stc
> >>
> name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm<
> > /str> 
> >   
> >
> > snippet of the Config for requestHandler
> >> default="true"> 
> >  
> >explicit
> >
> >true
> >default
> >true
> >
> >headline
> >pi
> >
> >headline
> >
> >true
> >
> >
> >
> >false
> >  
> > 
> >   clusteringComponent
> > 
> >   
> >
> >
> > When I perform a search, I see that the Cluster section within the Solr
> > results shows me results that are not quite consistent. There are two
> > documents that are reported in two different documents
> >
> > Are there parameters that can be set that will prevent this from
> happening
> > ?
> >
> >
> > Thanks much
> >
> > Ramdev
>


Re: assit with the Clustering component in Solr/Lucene

2011-03-30 Thread Markus Jelsma
Yes, you can set engine specific parameters. Check the comments in your 
snippety.

> Hi:
>   I recently included the CLustering component into Solr and updated the
> requestHandler accordingly (in solrconfig.xml). Snippet of the Config for
> the CLuserting:
> 
>name="clusteringComponent"
> enable="${solr.clustering.enabled:false}"
> class="org.apache.solr.handler.clustering.ClusteringComponent" >
> 
> 
>   
>   default
>   
>name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgori
> thm 
>   20
> 
> 
>   stc
>name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm<
> /str> 
>   
> 
> snippet of the Config for requestHandler
>default="true"> 
>  
>explicit
>
>true
>default
>true
>
>headline
>pi
>
>headline
>
>true
>
>
>
>false
>  
> 
>   clusteringComponent
> 
>   
> 
> 
> When I perform a search, I see that the Cluster section within the Solr
> results shows me results that are not quite consistent. There are two
> documents that are reported in two different documents
> 
> Are there parameters that can be set that will prevent this from happening
> ?
> 
> 
> Thanks much
> 
> Ramdev