[jira] [Commented] (SOLR-769) Support Document and Search Result clustering

2014-07-28 Thread Patrick Morton (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076124#comment-14076124
 ] 

Patrick Morton commented on SOLR-769:
-

In 2003, individuals played in often 130 senses. 
http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851520/7787447-29851520-stopadd30.html
 
Years on novel adderall 5mg in fxs are limited.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, clustering-componet-shard.patch, 
 clustering-libs.tar, clustering-libs.tar, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-769) Support Document and Search Result clustering

2014-07-28 Thread Patrick Morton (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076122#comment-14076122
 ] 

Patrick Morton commented on SOLR-769:
-

The atypical athletes of similar romantic face and health sleep competition 
some treatments which can make differentiating them unlikely. 
adderall price street 
http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851518/7787447-29851518-stopadd28.html
 
Despite the humor walton felt for hemingway, he was solely sinister of his 
pretreatment.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, clustering-componet-shard.patch, 
 clustering-libs.tar, clustering-libs.tar, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2010-12-13 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970758#action_12970758
 ] 

Koji Sekiguchi commented on SOLR-769:
-

Apologies Grant for quote your comment on 27/Jul/09:

bq. Also applied Stanislaw's patch.

I'm confused by this line:

{code}
ListDocument docs = outputSubClusters ? outCluster.getDocuments() : 
outCluster.getAllDocuments();
{code}

According to Carrot2 Javadoc:

http://download.carrot2.org/stable/javadoc/org/carrot2/core/Cluster.html#getAllDocuments%28%29

Should it be:

{code}
ListDocument docs = outputSubClusters ? outCluster.getAllDocuments() : 
outCluster.getDocuments();
{code}

?


 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2010-12-13 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970760#action_12970760
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi Koji,

Actually, the current code seems right: if we don't output subclusters, we need 
to include all documents of the cluster, including those from its subclusters, 
otherwise the subclusters' documents may not appear in the response at all. But 
if we do output subclusters, we add only the documents assigned specifically to 
the cluster because the subclusters with their documents will be included in 
the response too.

S.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2010-12-13 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970762#action_12970762
 ] 

Koji Sekiguchi commented on SOLR-769:
-

Uh, I needed to read the part of the recursive call. Thanks for explanation!

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-28 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736030#action_12736030
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi Grant,

There's one more thing: we're planning to release version 3.1.0 of Carrot2 with 
certain bug fixes in clustering algorithm and better support for Chinese (using 
the new analyzer from Lucene). Our plan is to release after Lucene 2.9 is out, 
but before Solr 1.4, so that the latter would have a newer version of Carrot2 
on board (should be just a matter of replacing Carrot2 JAR / upgrading version 
of the downloaded dependency). Would that make sense? Should I create a 
separate issue for it, or rather reopen this one?

Thanks,

S.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-28 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736037#action_12736037
 ] 

Grant Ingersoll commented on SOLR-769:
--

bq. Would that make sense? Should I create a separate issue for it, or rather 
reopen this one?

Yes, I think that makes sense.  Separate issue would be good, this one is long 
enough.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-28 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736039#action_12736039
 ] 

Stanislaw Osinski commented on SOLR-769:


Created: SOLR-1314. I'll attach a patch there as soon as Lucene 2.9 is released.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-27 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735635#action_12735635
 ] 

Grant Ingersoll commented on SOLR-769:
--

bq. classloading issues after the hander was removed from solr.war

I think the issue is that changes you made don't include the actual include the 
clustering code in Solr when running the example.   I think we just need to 
copy over the clustering JAR from the build directory into the lib, but that is 
a bit weird, IMO.  

To fix, I'm going to make the example target create a proper Solr home under 
contrib/clustering/example.  Which, of course, isn't much different from how it 
used to be.  I am also going to restore the downloads directory for 
packaging/release functionality.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-27 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735638#action_12735638
 ] 

Grant Ingersoll commented on SOLR-769:
--

Note, I believe there is also a classloading issue when trying to load the 
carrot algorithm, b/c it does not use the SolrResourceLoader

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-27 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735643#action_12735643
 ] 

Grant Ingersoll commented on SOLR-769:
--

OK, I have committed my changes and believe functionality is restored and is 
properly working with the SolrResourceLoader.  Also applied Stanislaw's patch.

Still likely need to review how to distribute all of this.  My guess is that we 
should only include the source, including the build and instructions for 
installing, and not even package jars at all since we can't include the LGPL 
ones necessary for Carrot2.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-07 Thread Brad Giaccio
If you could could my patch to handle shards be applied before you reformat
so I don't have to piece it together again and resubmit?

Brad

On Thu, Jul 2, 2009 at 9:53 AM, Yonik Seeley (JIRA) j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726482#action_12726482]

 Yonik Seeley commented on SOLR-769:
 ---

 Anyone mind if I reformat the source files that currently use tabs?

  Support Document and Search Result clustering
  -
 
  Key: SOLR-769
  URL: https://issues.apache.org/jira/browse/SOLR-769
  Project: Solr
   Issue Type: New Feature
 Reporter: Grant Ingersoll
 Assignee: Yonik Seeley
 Priority: Minor
  Fix For: 1.4
 
  Attachments: clustering-componet-shard.patch,
 clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch,
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
 SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip
 
 
  Clustering is a useful tool for working with documents and search
 results, similar to the notion of dynamic faceting.  Carrot2 (
 http://project.carrot2.org/) is a nice, BSD-licensed, library for doing
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is
 well suited for whole-corpus clustering.
  The patch I lays out a contrib module that starts off w/ an integration
 of a SearchComponent for doing clustering and an implementation using
 Carrot.  In search results mode, it will use the DocList as the input for
 the cluster.   While Carrot2 comes w/ a Solr input component, it is not the
 same as the SearchComponent that I have in that the Carrot example actually
 submits a query to Solr, whereas my SearchComponent is just chained into the
 Component list and uses the ResponseBuilder to add in the cluster results.
  While not fully fleshed out yet, the collection based mode will take in a
 list of ids or just use the whole collection and will produce clusters.
  Since this is a longer, typically offline task, there will need to be some
 type of storage mechanism (and replication??) for the clusters.  I _may_
 push this off to a separate JIRA issue, but I at least want to present the
 use case as part of the design of this component/contrib.  It may even make
 sense that we split this out, such that the building piece is something like
 an UpdateProcessor and then the SearchComponent just acts as a lookup
 mechanism.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-07 Thread Brad Giaccio (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12728270#action_12728270
 ] 

Brad Giaccio commented on SOLR-769:
---

If you could , could my patch to handle shards be applied before you reformat 
so I don't have to piece it together again and resubmit?

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-07 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12728271#action_12728271
 ] 

Yonik Seeley commented on SOLR-769:
---

Apologies Brad - I didn't realize there were pending patches or I would have 
not done the reformat.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-07 Thread Brad Giaccio
No problem, but now that you are the assignee perhaps you can apply it for
me, its attached to the ticket as 'clustering-component-shard.patch'  and it
includes update junit tests.

If it needs some work now that you have made some output changes I'll be
glad to update.

Thanks,
Brad

On Tue, Jul 7, 2009 at 2:16 PM, Yonik Seeley (JIRA) j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12728271#action_12728271]

 Yonik Seeley commented on SOLR-769:
 ---

 Apologies Brad - I didn't realize there were pending patches or I would
 have not done the reformat.

  Support Document and Search Result clustering
  -
 
  Key: SOLR-769
  URL: https://issues.apache.org/jira/browse/SOLR-769
  Project: Solr
   Issue Type: New Feature
 Reporter: Grant Ingersoll
 Assignee: Yonik Seeley
 Priority: Minor
  Fix For: 1.4
 
  Attachments: clustering-componet-shard.patch,
 clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch,
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
 SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip
 
 
  Clustering is a useful tool for working with documents and search
 results, similar to the notion of dynamic faceting.  Carrot2 (
 http://project.carrot2.org/) is a nice, BSD-licensed, library for doing
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is
 well suited for whole-corpus clustering.
  The patch I lays out a contrib module that starts off w/ an integration
 of a SearchComponent for doing clustering and an implementation using
 Carrot.  In search results mode, it will use the DocList as the input for
 the cluster.   While Carrot2 comes w/ a Solr input component, it is not the
 same as the SearchComponent that I have in that the Carrot example actually
 submits a query to Solr, whereas my SearchComponent is just chained into the
 Component list and uses the ResponseBuilder to add in the cluster results.
  While not fully fleshed out yet, the collection based mode will take in a
 list of ids or just use the whole collection and will produce clusters.
  Since this is a longer, typically offline task, there will need to be some
 type of storage mechanism (and replication??) for the clusters.  I _may_
 push this off to a separate JIRA issue, but I at least want to present the
 use case as part of the design of this component/contrib.  It may even make
 sense that we split this out, such that the building piece is something like
 an UpdateProcessor and then the SearchComponent just acts as a lookup
 mechanism.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-04 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12727247#action_12727247
 ] 

Yonik Seeley commented on SOLR-769:
---

Of course, now that I've removed the clustering libs from the solr.war, the 
example no longer works for some reason... looks like all the jars are in 
example/clustering/solr/lib, so it's classloading issues I imagine.

On a related note, I'm not sure how useful it is to have a clustering component 
with multiple plugins itself... the extra level of plugins seems to just add 
more complexity.  Different plugins could always share utility classes, perhaps 
even base classes, and could strive for a common output format - all without 
going to an additional plugin model.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-02 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726482#action_12726482
 ] 

Yonik Seeley commented on SOLR-769:
---

Anyone mind if I reformat the source files that currently use tabs?

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726485#action_12726485
 ] 

Mark Miller commented on SOLR-769:
--

bq. Anyone mind if I reformat the source files that currently use tabs? 

+1

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725694#action_12725694
 ] 

Yonik Seeley commented on SOLR-769:
---

Now that I'm looking at some of the code, is there a reason why clustering 
doesn't use a SolrQueryRequest, but instead grabs a searcher directly from the 
core?

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725702#action_12725702
 ] 

Grant Ingersoll commented on SOLR-769:
--

bq. Now that I'm looking at some of the code, is there a reason why clustering 
doesn't use a SolrQueryRequest, but instead grabs a searcher directly from the 
core?

Because the clustering engine gets initialized during core initialization and 
thus doesn't have a SolrQueryRequest at that time.  Is there harm in the way 
it's being done?  I suppose it adds an extra reference, right, meaning it could 
keep a core open longer?  

In the case of document clustering, I think it could be a long running job.  
It's not clear yet how that should work, but it is something to keep in mind.  
I expect to implement that sometime this summer, likely after 1.4.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725703#action_12725703
 ] 

Grant Ingersoll commented on SOLR-769:
--

Also, some  implementations may need lower level interfaces than Searcher, it 
just seems easier to have core access.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725730#action_12725730
 ] 

Yonik Seeley commented on SOLR-769:
---

I'm talking about the search results clustering, which is per-request.  
RequestHandlers should pretty much always use the core/searcher associated with 
the SolrQueryRequest.  newSearcher/firstSearcher hooks set this themselves, 
hence it's a different searcher than one would get from getSearcher() (and 
could possibly even cause a deadlock).   Architecturally, there could be any 
number of reasons to use a different searcher in the future... the 
SolrQueryRequest says which searcher to use.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725731#action_12725731
 ] 

Grant Ingersoll commented on SOLR-769:
--

Makes sense, might need to refactor some of the initialization code and the 
abstract clustering engine, but no big deal.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725739#action_12725739
 ] 

Stanislaw Osinski commented on SOLR-769:


bq. Is labels is needed because there could be multiple labels per cluster in 
the future? ( I assume yes)

Correct. Currently neither of Carrot2's algorithms creates clusters with 
multiple labels, but it's quite likely that there are other algorithms that can 
do that.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-27 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724847#action_12724847
 ] 

Yonik Seeley commented on SOLR-769:
---

The response structure is a bit funny (it's like normal XML, which we don't 
really use in Solr-land), and certainly not optimal for JSON responses:

{code}
 clusters:[
  cluster,[
labels,[
 label,DDR],
docs,[
 doc,TWINX2048-3200PRO,
 doc,VS1GB400C3,
 doc,VDBDB1A16]],
  cluster,[
labels,[
 label,Car Power Adapter],
docs,[
 doc,F8V7067-APL-KIT,
 doc,IW-02]],
[...]
{code}

Is labels  is needed because there could be multiple labels per cluster in 
the future?  ( I assume yes)
Do we need more per-doc information than just the id?  (I assume no)
Could we want other per-cluster information in the future (I assume yes)
What other possible information could be added in the future?

Given the assumptions above, clusters, docs, and labels should all be 
arrays instead of NamedLists (the names are just repeated redundant info).
All of the remaining NamedLists(just each cluster) should be a 
SimpleOrderedMap since access by key is more important than order... that will 
give us something along the lines of:

{code}
clusters : [
{ labels : [DDR],
docs:[TWINX2048-3200PRO,VS1GB400C3,VDBDB1A16]
}
,
{ labels : [Car Power Adapter],
docs:[F8V7067-APL-KIT,IW-02]
}
]
{code}

Make sense?

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-27 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724854#action_12724854
 ] 

Yonik Seeley commented on SOLR-769:
---

I hit an error trying to cluster some documents I added with solr cell - 400 
unknown field Author.
Seems like it would be nice if we could handle unknown field types gracefully?

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712534#action_12712534
 ] 

Stanislaw Osinski commented on SOLR-769:


In fact, you can set Carrot2 attributes (both init- and request-time) in the 
solr config file, this should work also without the patch. Just add:

{{str name=Tokenizer.analyzerfully.qualified.class.Name/str}}

to the search component element. See 
http://wiki.apache.org/solr/ClusteringComponent for some example. You'll find 
list of Carrot2 attributes, their ids and description at: 
http://download.carrot2.org/stable/manual/#chapter.components.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712544#action_12712544
 ] 

Koji Sekiguchi commented on SOLR-769:
-

{quote}
In fact, you can set Carrot2 attributes (both init- and request-time) in the 
solr config file, this should work also without the patch. Just add:

str name=Tokenizer.analyzerfully.qualified.class.Name/str
{quote}

Hmm, I thought I need to assign Class? type (other than String) for the 
second argument of the attribute. I'll try it.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712545#action_12712545
 ] 

Stanislaw Osinski commented on SOLR-769:


Ah, I should have mentioned that up front -- Carrot2 will try to convert the 
string into the type accepted by the attribute. In case of the class-types 
attributes, it will try to load the class using the current thread's context 
classloader. Conversions are also available for numeric, boolean and enum 
attributes (see: 
http://download.carrot2.org/head/javadoc/org/carrot2/util/attribute/AttributeBinder.AttributeTransformerFromString.html).
 Please let me know if that way works for you.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712557#action_12712557
 ] 

Koji Sekiguchi commented on SOLR-769:
-

{code}
str name=Tokenizer.analyzerfully.qualified.class.Name/str
{code}

This works as expected w/o my patch. Thank you, Stanislaw!


 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-23 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712421#action_12712421
 ] 

Stanislaw Osinski commented on SOLR-769:


Pasting the comment I made on the list:

The catch with analyzer is that this specific attribute is an 
initialization-time attribute, so you need to add it to the {{initAttributes}} 
map in the {{init()}} method of {{CarrotClusteringEngine}}.

Please let me know if this solves the problem. If not, I'll investigate further.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-23 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712442#action_12712442
 ] 

Koji Sekiguchi commented on SOLR-769:
-

bq. The catch with analyzer is that this specific attribute is an 
initialization-time attribute, so you need to add it to the initAttributes map 
in the init() method of CarrotClusteringEngine.

This solves the problem. Thank you!

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-20 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711137#action_12711137
 ] 

Grant Ingersoll commented on SOLR-769:
--

Committed revision 776692.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, 
 SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-16 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710087#action_12710087
 ] 

Stanislaw Osinski commented on SOLR-769:


Thanks Grant! Looking forward to seeing the code in the repo!

S.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, 
 SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-16 Thread Allahbaksh Mohammedali (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710093#action_12710093
 ] 

Allahbaksh Mohammedali commented on SOLR-769:
-

Hi Grant,
I am looking forward keenly to see this feature. I want to see it in action as 
soon as possible. When the Code will be comitted to repo?

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, 
 SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-04-19 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700628#action_12700628
 ] 

Grant Ingersoll commented on SOLR-769:
--

Where can we download nni.jar from?  

Seems like if you only need two classes it would be easy enough to replace them 
with your own code.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-04-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699908#action_12699908
 ] 

Grant Ingersoll commented on SOLR-769:
--

Looks like we need to make the NNI JAR be a download, too, right?  It appears 
to be LGPL.  Where does that library come from, anyway?  I don't see it on 
Carrot trunk, but it is in the zip.  And a search for it doesn't reveal much.

-Grant

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-04-03 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695401#action_12695401
 ] 

Grant Ingersoll commented on SOLR-769:
--

Hi Stanislaw,

I'm going to commit soon and I was wondering if Carrot2 has a handy place where 
they keep all the licenses and notices so that I can fill out Solr's NOTICE.txt 
and LICENSE.txt.  If not, I will go collate them.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-04-03 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695463#action_12695463
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi Grant,

If you download http://download.carrot2.org/stable/carrot2-java-api-3.0.1.zip, 
you'll find licenses in the lib/ folder of the distribution. That distribution 
contains slightly more JARs than needed for Solr (which uses carrot2-mini.jar), 
so you'd need to pick only those that are relevant.

S.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-22 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688132#action_12688132
 ] 

Grant Ingersoll commented on SOLR-769:
--

bq. Highlighting:

Hmm, that's an interesting thought.  We could check to see if highlighting is 
done first.

Also, you say C2 can handle full docs, is it feasible, then to implement it for 
the offline mode I have in mind, whereby you cluster the whole collection 
offline and then store the clusters for retrieval?  I haven't implemented this 
yet, but was thinking some people will be interested in full corpus clustering. 
 The nice thing, then, is that as new documents come in, they can be added to 
existing clusters (and maybe periodically, we re-cluster).  Just thinking 
outloud.

Rest of the stuff in that comment sounds good.  I will try out the patch.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-22 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688141#action_12688141
 ] 

Grant Ingersoll commented on SOLR-769:
--

Should the MockClusteringAlgorithm be under the test source tree and not the 
main one?  I moved it in the patch to follow

I don't think we need to output the number of clusters, since that will be 
obvious from the list size.  I dropped it in the patch to follow

Also, on the response structure, we certainly could make it optional, although 
it means having to go do a lookup in the real doc list, which could be less 
than fun.

Patch to follow

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-22 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688171#action_12688171
 ] 

Stanislaw Osinski commented on SOLR-769:


bq. Also, you say C2 can handle full docs, is it feasible, then to implement it 
for the offline mode I have in mind, whereby you cluster the whole collection 
offline and then store the clusters for retrieval? I haven't implemented this 
yet, but was thinking some people will be interested in full corpus clustering. 
The nice thing, then, is that as new documents come in, they can be added to 
existing clusters (and maybe periodically, we re-cluster). Just thinking 
outloud.

We have two variables here: the length of docs and the number of docs. Carrot2 
is suitable for small numbers of docs (up to say 1000). If the docs are short 
(a paragraph or so), the clustering should be pretty fast, suitable for on-line 
processing (see: http://project.carrot2.org/algorithms.html). If the documents 
get longer, Carrot2 will still handle them, but will require some more time for 
processing, I'll try to do some measurements. But C2 is not useful for the 
whole collection case -- it performs all processing in-memory and here we'd 
need a totally different class of algorithm, something along the lines of 
Mahout developments.

bq. Hmm, that's an interesting thought. We could check to see if highlighting 
is done first.

To quickly summarise the pros and cons of relying on highlighting being done 
outside of the clustering component:

Pros:

* we avoid duplication of processing (highlighting being done twice)
* simpler code of the clustering component, less configuration

Cons:

* if someone doesn't want highlighting in the search results, the clustering is 
likely to take more time (because it operates on full documents, and it's 
controlled globally)
* depending on the highlighter, we may get some markup in the summaries, which 
may affect clustering (I'd need to check how Carrot2 handles that)

bq. Should the MockClusteringAlgorithm be under the test source tree and not 
the main one? I moved it in the patch to follow 

Absolutely, it should be in the test source.

bq. I don't think we need to output the number of clusters, since that will be 
obvious from the list size. I dropped it in the patch to follow

Makes sense, I kept it because the original version had it.

bq. Also, on the response structure, we certainly could make it optional, 
although it means having to go do a lookup in the real doc list, which could be 
less than fun.

By lookup you mean the lookup in the XML response? Here again we have a trade 
off between the length of the response and ease of processing: if we repeat 
document titles / snippets in the clusters structure, we at least double the 
response size (at least because the same document may belong to many clusters), 
but can potentially save some lookups. But if we want to get some other fields 
of a document (other than we repeat in the clusters list), we'd still need a 
lookup. 

To sum up, my intuition would be to avoid duplication and stick with document 
ids in cluster list (this is what we do in Carrot2 XMLs as well). Optionally, 
the clustering component could have a list of configurable fields to be 
repeated in the cluster list if that's really helpful in real-word use cases.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or 

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-11 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680942#action_12680942
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi All,

I've just uploaded a patch that passes unit tests and has working example, but 
this is by no means a final version. A few outstanding questions / issues:

# h4. Response structure.

I was wondering -- to we need to repeat the document contents in the 'clusters' 
response section? Assuming that each document in the index has a unique ID, we 
could reduce the size of the response by just referencing documents by IDs like 
this:
\\
{code}
lst name=clusters
 int name=numClusters3/int
 lst name=cluster
  lst name=labels
str name=labelGPU VPU Clocked/str
  /lst
  lst name=docs
str name=docEN7800GTX/2DHTV/256M/str
str name=doc100-435805/str
  /lst
 /lst
 lst name=cluster
  lst name=labels
str name=labelHard Drive/str
  /lst
  lst name=docs
str name=doc6H500F0/str
str name=docSP2514N/str
  /lst
 /lst
 lst name=cluster
  lst name=labels
str name=labelOther Topics/str
  /lst
  lst name=docs
str name=doc9885A004/str
  /lst
 /lst
{code}
Actually, this is what I've implemented in the patch.

Also, in case of hierarchical clusters I've introduced a grouping entity called 
clusters so that the top- and sub-levels or the response are consistent (see 
unit tests). Please let me know if this makes sense.

# h4 Build: compile warnings about missing SimpleXML

SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not 
needed at runtime, but generates warnings about missing dependencies during 
compile time. So the option is either to live with the warnings or to add 
SimpleXML (version 1.7.2) to get rid of the warnings.

# h4 Build: copying of protowords.txt etc

The patch includes lexical files both in the 
contrib/clustering/src/java/test/resources/ and in the examples dir. I'm 
not sure how this is handled though -- do you keep copies in the repository or 
copy those somehow in the build?

# h4 Highlighting

This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly 
well handle full documents (up to say a few hundred kB each), it's just the 
number of documents that must be in the order of hundreds. Therefore, 
highlighting is not mandatory, but it may sometimes improve the quality of 
clusters.

I was wondering, if highlighting is performed earlier in the Solr pipeline, 
could this be reused during clustering? One possible approach could be that 
clustering uses whatever is fed from the pipeline: if highlighting is enabled, 
clustering will be performed on the highlighted content, if there was no 
highlighting, we'd cluster full documents. Not sure if that's reasonable / 
possible to implement though.

# h4 Documentation (wiki) updates

Once we stabilise the ideas, I'm happy to update the wiki with regard to the 
algorithms used (Lingo/STC) and passing additional parameters.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, 

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-02-10 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672281#action_12672281
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi Grant,

I've added a Carrot2 issue referring to point 3 on your TODO list: 
http://issues.carrot2.org/browse/CARROT-457. I'll be looking into this over the 
weekend.

Staszek

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-23 Thread Bruce Ritchie (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642179#action_12642179
 ] 

Bruce Ritchie commented on SOLR-769:


Grant,

This patch looks very promising, I can't wait to give it a try and find a way 
to incorporate it into a project I'm working on (when it's ready of course ... 
likely not till after Carrot2 3  is released though)

Can you give a quick estimate as to the performance impact of enabling 
clustering in search results mode? In the example @ 
http://wiki.apache.org/solr/ClusteringFullResultsExample the query time seems 
pretty high and I was wondering if that was a result of this patch or something 
else?

Thanks,

Bruce Ritchie

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-23 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642182#action_12642182
 ] 

Stanislaw Osinski commented on SOLR-769:


Bruce,

For performance of the clustering algorithm alone, please take a look at: 
http://project.carrot2.org/algorithms.html
Obviously, you'd need to add the overhead of fetching the snippets / documents 
from the index. Not sure how many are fetched and whether they come from Solr's 
cache or not, so not sure if clustering or fetching time is prevailing.

Cheers,

Staszek

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-23 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642187#action_12642187
 ] 

Grant Ingersoll commented on SOLR-769:
--

Hi Bruce,

I haven't done any perf. testing, as I've been focused on functionality first.  
However, I'm not sure whether that query was the first one run, or not, so I 
don't know the status of the searcher, etc.  I'm pretty sure I don't have any 
warming queries, etc.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-22 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641823#action_12641823
 ] 

Grant Ingersoll commented on SOLR-769:
--

{quote}So what would be the procedure to add some clustering code beyond carrot 
or other available libraries.
{quote}

Essentially, you need to implement either a SearchClusteringEngine or a 
DocumentClusteringEngine and then hook declare it in the SearchComponent 
configuration, as is done with the Carrot2 example here:
{code}
lst name=engine
  !-- The name, only one can be named default --
  str name=namedefault/str
  !-- Carrot2 specific parameters.  See the Carrot2 site for details on 
setting. --
  !-- carrot.algorithm:   Optional.  Currently only
  lingo is supported pending the release of Carrot2 3.0.  
   --
  str name=carrot.algorithmlingo/str
  !-- Lingo specific --
  float name=carrot.lingo.threshold.clusterAssignment0.150/float
  float 
name=carrot.lingo.threshold.candidateClusterThreshold0.775/float
/lst
{code}
or, in the mock setup:
{code}
lst name=engine
  !-- The name, only one can be named default --
  str name=namedocEngine/str
  str 
name=classnameorg.apache.solr.handler.clustering.MockDocumentClusteringEngine/str
/lst
{code}

If you don't declare the classname value, then it assumes the Carrot 
implementation.

Naturally, you need to take care of all the libraries being available to Solr, 
etc. just as you would for any plugin.

Since you are interested in clustering, Vaijanath, it would be good to get your 
feedback on the APIs.  Are you doing full document clustering or just search 
snippet clustering?   Also, if you are using an open source clustering library 
that has acceptable licensing terms (i.e. not GPL or similar), perhaps consider 
contributing an implementation of the engine and then we can make it available 
to everyone.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-22 Thread Vaijanath N. Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641832#action_12641832
 ] 

Vaijanath N. Rao commented on SOLR-769:
---

Hi Grant,

Till now I have worked mostly with full document clustering. Had never thought 
of search snippet clustering.  I will definitely pitch in for clustering 
library. There are many libraries which have favourable/acceptable licensing 
terms which can be added to Solr.

--Thanks and Regards
Vaijanath

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-15 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12639835#action_12639835
 ] 

Grant Ingersoll commented on SOLR-769:
--

Note, also, that even though I put in support for some of the other C2 
(Carrot2) algorithms, I don't think they quite work yet.  I think they require 
passing in more parameters to set some algorithm properties (for instance, for 
Fuzzy Ants, I think you need to set a depth) and I haven't figured those out 
yet.  If you have C2 experience, insight would be appreciated.

For now, stick to Lingo.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-12 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638875#action_12638875
 ] 

Andrzej Bialecki  commented on SOLR-769:


FYI, Carrot2 does support a handful of different clustering algorithms (the 
ones I know of are Fuzzy Ants, KMeans and Suffix Tree, in addition to Lingo).

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638924#action_12638924
 ] 

Grant Ingersoll commented on SOLR-769:
--

Yeah, I probably will include the other jars and make it easy to include them.  
For now, I wanted to get something basic working for a talk I'm giving on 
Wednesday night ;-)

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638791#action_12638791
 ] 

Grant Ingersoll commented on SOLR-769:
--

Patch soon, as a start.  I'm going to check in the basic directory structure 
and libs, and then provide a patch with the source that we can iterate on.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor

 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638814#action_12638814
 ] 

Grant Ingersoll commented on SOLR-769:
--

Still to do, more testing, get feedback, implement basics of doc. clustering.  
This last piece will take some more design work.  Also need to validate some 
more that the results make sense for search results clustering, but my first 
look suggests they do.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-10 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638637#action_12638637
 ] 

Grant Ingersoll commented on SOLR-769:
--

Starting docs at http://wiki.apache.org/solr/ClusteringComponent

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor

 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.