[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-769:
---------------------------------

    Attachment: SOLR-769.patch

First draft of a patch.

Notes:

1. Carrot2 uses the snowball stemmers, but it shouldn't clash, b/c it actually 
slightly changes the names of them to be like englishStemmer (as opposed to 
EnglishStemmer).  I'm debating whether or not to just re-implement this so that 
it can use the same snowball stemmers we use in Solr.  Probably not a big deal.

2. I haven't implemented document clustering yet.  To do this, I need to setup 
a background thread that will be spawned to do the clustering, since it is 
presumably going through some large set of documents and clustering them.  To 
do this, it will probably require term vectors.  This will introduce a dep. on 
Mahout, so I'll need a version of that library too.

3. It would be really cool for the Carrot2 implementation to support using 
other clustering algs besides Lingo.  Basically, this just needs to be factored 
into the configuration and the jars included in the distribution.  This is not 
a high priority for me at the moment.

TODO:
More tests.
Decide on output format
Implement doc. clustering framework part (i.e. spawning of threads, commands)
????

> Support Document and Search Result clustering
> ---------------------------------------------
>
>                 Key: SOLR-769
>                 URL: https://issues.apache.org/jira/browse/SOLR-769
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: clustering-libs.tar, SOLR-769.patch
>
>
> Clustering is a useful tool for working with documents and search results, 
> similar to the notion of dynamic faceting.  Carrot2 
> (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
> search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
> suited for whole-corpus clustering.  
> The patch I lays out a contrib module that starts off w/ an integration of a 
> SearchComponent for doing clustering and an implementation using Carrot.  In 
> search results mode, it will use the DocList as the input for the cluster.   
> While Carrot2 comes w/ a Solr input component, it is not the same as the 
> SearchComponent that I have in that the Carrot example actually submits a 
> query to Solr, whereas my SearchComponent is just chained into the Component 
> list and uses the ResponseBuilder to add in the cluster results.
> While not fully fleshed out yet, the collection based mode will take in a 
> list of ids or just use the whole collection and will produce clusters.  
> Since this is a longer, typically offline task, there will need to be some 
> type of storage mechanism (and replication??????) for the clusters.  I _may_ 
> push this off to a separate JIRA issue, but I at least want to present the 
> use case as part of the design of this component/contrib.  It may even make 
> sense that we split this out, such that the building piece is something like 
> an UpdateProcessor and then the SearchComponent just acts as a lookup 
> mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to