[jira] [Commented] (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076124#comment-14076124 ] Patrick Morton commented on SOLR-769: - In 2003, individuals played in often 130 senses. http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851520/7787447-29851520-stopadd30.html Years on novel adderall 5mg in fxs are limited. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076122#comment-14076122 ] Patrick Morton commented on SOLR-769: - The atypical athletes of similar romantic face and health sleep competition some treatments which can make differentiating them unlikely. adderall price street http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851518/7787447-29851518-stopadd28.html Despite the humor walton felt for hemingway, he was solely sinister of his pretreatment. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970758#action_12970758 ] Koji Sekiguchi commented on SOLR-769: - Apologies Grant for quote your comment on 27/Jul/09: bq. Also applied Stanislaw's patch. I'm confused by this line: {code} ListDocument docs = outputSubClusters ? outCluster.getDocuments() : outCluster.getAllDocuments(); {code} According to Carrot2 Javadoc: http://download.carrot2.org/stable/javadoc/org/carrot2/core/Cluster.html#getAllDocuments%28%29 Should it be: {code} ListDocument docs = outputSubClusters ? outCluster.getAllDocuments() : outCluster.getDocuments(); {code} ? Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970760#action_12970760 ] Stanislaw Osinski commented on SOLR-769: Hi Koji, Actually, the current code seems right: if we don't output subclusters, we need to include all documents of the cluster, including those from its subclusters, otherwise the subclusters' documents may not appear in the response at all. But if we do output subclusters, we add only the documents assigned specifically to the cluster because the subclusters with their documents will be included in the response too. S. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970762#action_12970762 ] Koji Sekiguchi commented on SOLR-769: - Uh, I needed to read the part of the recursive call. Thanks for explanation! Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736030#action_12736030 ] Stanislaw Osinski commented on SOLR-769: Hi Grant, There's one more thing: we're planning to release version 3.1.0 of Carrot2 with certain bug fixes in clustering algorithm and better support for Chinese (using the new analyzer from Lucene). Our plan is to release after Lucene 2.9 is out, but before Solr 1.4, so that the latter would have a newer version of Carrot2 on board (should be just a matter of replacing Carrot2 JAR / upgrading version of the downloaded dependency). Would that make sense? Should I create a separate issue for it, or rather reopen this one? Thanks, S. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736037#action_12736037 ] Grant Ingersoll commented on SOLR-769: -- bq. Would that make sense? Should I create a separate issue for it, or rather reopen this one? Yes, I think that makes sense. Separate issue would be good, this one is long enough. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736039#action_12736039 ] Stanislaw Osinski commented on SOLR-769: Created: SOLR-1314. I'll attach a patch there as soon as Lucene 2.9 is released. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735635#action_12735635 ] Grant Ingersoll commented on SOLR-769: -- bq. classloading issues after the hander was removed from solr.war I think the issue is that changes you made don't include the actual include the clustering code in Solr when running the example. I think we just need to copy over the clustering JAR from the build directory into the lib, but that is a bit weird, IMO. To fix, I'm going to make the example target create a proper Solr home under contrib/clustering/example. Which, of course, isn't much different from how it used to be. I am also going to restore the downloads directory for packaging/release functionality. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735638#action_12735638 ] Grant Ingersoll commented on SOLR-769: -- Note, I believe there is also a classloading issue when trying to load the carrot algorithm, b/c it does not use the SolrResourceLoader Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735643#action_12735643 ] Grant Ingersoll commented on SOLR-769: -- OK, I have committed my changes and believe functionality is restored and is properly working with the SolrResourceLoader. Also applied Stanislaw's patch. Still likely need to review how to distribute all of this. My guess is that we should only include the source, including the build and instructions for installing, and not even package jars at all since we can't include the LGPL ones necessary for Carrot2. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (SOLR-769) Support Document and Search Result clustering
If you could could my patch to handle shards be applied before you reformat so I don't have to piece it together again and resubmit? Brad On Thu, Jul 2, 2009 at 9:53 AM, Yonik Seeley (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726482#action_12726482] Yonik Seeley commented on SOLR-769: --- Anyone mind if I reformat the source files that currently use tabs? Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 ( http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12728270#action_12728270 ] Brad Giaccio commented on SOLR-769: --- If you could , could my patch to handle shards be applied before you reformat so I don't have to piece it together again and resubmit? Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12728271#action_12728271 ] Yonik Seeley commented on SOLR-769: --- Apologies Brad - I didn't realize there were pending patches or I would have not done the reformat. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (SOLR-769) Support Document and Search Result clustering
No problem, but now that you are the assignee perhaps you can apply it for me, its attached to the ticket as 'clustering-component-shard.patch' and it includes update junit tests. If it needs some work now that you have made some output changes I'll be glad to update. Thanks, Brad On Tue, Jul 7, 2009 at 2:16 PM, Yonik Seeley (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12728271#action_12728271] Yonik Seeley commented on SOLR-769: --- Apologies Brad - I didn't realize there were pending patches or I would have not done the reformat. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 ( http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12727247#action_12727247 ] Yonik Seeley commented on SOLR-769: --- Of course, now that I've removed the clustering libs from the solr.war, the example no longer works for some reason... looks like all the jars are in example/clustering/solr/lib, so it's classloading issues I imagine. On a related note, I'm not sure how useful it is to have a clustering component with multiple plugins itself... the extra level of plugins seems to just add more complexity. Different plugins could always share utility classes, perhaps even base classes, and could strive for a common output format - all without going to an additional plugin model. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726482#action_12726482 ] Yonik Seeley commented on SOLR-769: --- Anyone mind if I reformat the source files that currently use tabs? Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726485#action_12726485 ] Mark Miller commented on SOLR-769: -- bq. Anyone mind if I reformat the source files that currently use tabs? +1 Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725694#action_12725694 ] Yonik Seeley commented on SOLR-769: --- Now that I'm looking at some of the code, is there a reason why clustering doesn't use a SolrQueryRequest, but instead grabs a searcher directly from the core? Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725702#action_12725702 ] Grant Ingersoll commented on SOLR-769: -- bq. Now that I'm looking at some of the code, is there a reason why clustering doesn't use a SolrQueryRequest, but instead grabs a searcher directly from the core? Because the clustering engine gets initialized during core initialization and thus doesn't have a SolrQueryRequest at that time. Is there harm in the way it's being done? I suppose it adds an extra reference, right, meaning it could keep a core open longer? In the case of document clustering, I think it could be a long running job. It's not clear yet how that should work, but it is something to keep in mind. I expect to implement that sometime this summer, likely after 1.4. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725703#action_12725703 ] Grant Ingersoll commented on SOLR-769: -- Also, some implementations may need lower level interfaces than Searcher, it just seems easier to have core access. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725730#action_12725730 ] Yonik Seeley commented on SOLR-769: --- I'm talking about the search results clustering, which is per-request. RequestHandlers should pretty much always use the core/searcher associated with the SolrQueryRequest. newSearcher/firstSearcher hooks set this themselves, hence it's a different searcher than one would get from getSearcher() (and could possibly even cause a deadlock). Architecturally, there could be any number of reasons to use a different searcher in the future... the SolrQueryRequest says which searcher to use. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725731#action_12725731 ] Grant Ingersoll commented on SOLR-769: -- Makes sense, might need to refactor some of the initialization code and the abstract clustering engine, but no big deal. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725739#action_12725739 ] Stanislaw Osinski commented on SOLR-769: bq. Is labels is needed because there could be multiple labels per cluster in the future? ( I assume yes) Correct. Currently neither of Carrot2's algorithms creates clusters with multiple labels, but it's quite likely that there are other algorithms that can do that. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724847#action_12724847 ] Yonik Seeley commented on SOLR-769: --- The response structure is a bit funny (it's like normal XML, which we don't really use in Solr-land), and certainly not optimal for JSON responses: {code} clusters:[ cluster,[ labels,[ label,DDR], docs,[ doc,TWINX2048-3200PRO, doc,VS1GB400C3, doc,VDBDB1A16]], cluster,[ labels,[ label,Car Power Adapter], docs,[ doc,F8V7067-APL-KIT, doc,IW-02]], [...] {code} Is labels is needed because there could be multiple labels per cluster in the future? ( I assume yes) Do we need more per-doc information than just the id? (I assume no) Could we want other per-cluster information in the future (I assume yes) What other possible information could be added in the future? Given the assumptions above, clusters, docs, and labels should all be arrays instead of NamedLists (the names are just repeated redundant info). All of the remaining NamedLists(just each cluster) should be a SimpleOrderedMap since access by key is more important than order... that will give us something along the lines of: {code} clusters : [ { labels : [DDR], docs:[TWINX2048-3200PRO,VS1GB400C3,VDBDB1A16] } , { labels : [Car Power Adapter], docs:[F8V7067-APL-KIT,IW-02] } ] {code} Make sense? Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724854#action_12724854 ] Yonik Seeley commented on SOLR-769: --- I hit an error trying to cluster some documents I added with solr cell - 400 unknown field Author. Seems like it would be nice if we could handle unknown field types gracefully? Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712534#action_12712534 ] Stanislaw Osinski commented on SOLR-769: In fact, you can set Carrot2 attributes (both init- and request-time) in the solr config file, this should work also without the patch. Just add: {{str name=Tokenizer.analyzerfully.qualified.class.Name/str}} to the search component element. See http://wiki.apache.org/solr/ClusteringComponent for some example. You'll find list of Carrot2 attributes, their ids and description at: http://download.carrot2.org/stable/manual/#chapter.components. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712544#action_12712544 ] Koji Sekiguchi commented on SOLR-769: - {quote} In fact, you can set Carrot2 attributes (both init- and request-time) in the solr config file, this should work also without the patch. Just add: str name=Tokenizer.analyzerfully.qualified.class.Name/str {quote} Hmm, I thought I need to assign Class? type (other than String) for the second argument of the attribute. I'll try it. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712545#action_12712545 ] Stanislaw Osinski commented on SOLR-769: Ah, I should have mentioned that up front -- Carrot2 will try to convert the string into the type accepted by the attribute. In case of the class-types attributes, it will try to load the class using the current thread's context classloader. Conversions are also available for numeric, boolean and enum attributes (see: http://download.carrot2.org/head/javadoc/org/carrot2/util/attribute/AttributeBinder.AttributeTransformerFromString.html). Please let me know if that way works for you. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712557#action_12712557 ] Koji Sekiguchi commented on SOLR-769: - {code} str name=Tokenizer.analyzerfully.qualified.class.Name/str {code} This works as expected w/o my patch. Thank you, Stanislaw! Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712421#action_12712421 ] Stanislaw Osinski commented on SOLR-769: Pasting the comment I made on the list: The catch with analyzer is that this specific attribute is an initialization-time attribute, so you need to add it to the {{initAttributes}} map in the {{init()}} method of {{CarrotClusteringEngine}}. Please let me know if this solves the problem. If not, I'll investigate further. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712442#action_12712442 ] Koji Sekiguchi commented on SOLR-769: - bq. The catch with analyzer is that this specific attribute is an initialization-time attribute, so you need to add it to the initAttributes map in the init() method of CarrotClusteringEngine. This solves the problem. Thank you! Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711137#action_12711137 ] Grant Ingersoll commented on SOLR-769: -- Committed revision 776692. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710087#action_12710087 ] Stanislaw Osinski commented on SOLR-769: Thanks Grant! Looking forward to seeing the code in the repo! S. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710093#action_12710093 ] Allahbaksh Mohammedali commented on SOLR-769: - Hi Grant, I am looking forward keenly to see this feature. I want to see it in action as soon as possible. When the Code will be comitted to repo? Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700628#action_12700628 ] Grant Ingersoll commented on SOLR-769: -- Where can we download nni.jar from? Seems like if you only need two classes it would be easy enough to replace them with your own code. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699908#action_12699908 ] Grant Ingersoll commented on SOLR-769: -- Looks like we need to make the NNI JAR be a download, too, right? It appears to be LGPL. Where does that library come from, anyway? I don't see it on Carrot trunk, but it is in the zip. And a search for it doesn't reveal much. -Grant Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695401#action_12695401 ] Grant Ingersoll commented on SOLR-769: -- Hi Stanislaw, I'm going to commit soon and I was wondering if Carrot2 has a handy place where they keep all the licenses and notices so that I can fill out Solr's NOTICE.txt and LICENSE.txt. If not, I will go collate them. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695463#action_12695463 ] Stanislaw Osinski commented on SOLR-769: Hi Grant, If you download http://download.carrot2.org/stable/carrot2-java-api-3.0.1.zip, you'll find licenses in the lib/ folder of the distribution. That distribution contains slightly more JARs than needed for Solr (which uses carrot2-mini.jar), so you'd need to pick only those that are relevant. S. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688132#action_12688132 ] Grant Ingersoll commented on SOLR-769: -- bq. Highlighting: Hmm, that's an interesting thought. We could check to see if highlighting is done first. Also, you say C2 can handle full docs, is it feasible, then to implement it for the offline mode I have in mind, whereby you cluster the whole collection offline and then store the clusters for retrieval? I haven't implemented this yet, but was thinking some people will be interested in full corpus clustering. The nice thing, then, is that as new documents come in, they can be added to existing clusters (and maybe periodically, we re-cluster). Just thinking outloud. Rest of the stuff in that comment sounds good. I will try out the patch. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688141#action_12688141 ] Grant Ingersoll commented on SOLR-769: -- Should the MockClusteringAlgorithm be under the test source tree and not the main one? I moved it in the patch to follow I don't think we need to output the number of clusters, since that will be obvious from the list size. I dropped it in the patch to follow Also, on the response structure, we certainly could make it optional, although it means having to go do a lookup in the real doc list, which could be less than fun. Patch to follow Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688171#action_12688171 ] Stanislaw Osinski commented on SOLR-769: bq. Also, you say C2 can handle full docs, is it feasible, then to implement it for the offline mode I have in mind, whereby you cluster the whole collection offline and then store the clusters for retrieval? I haven't implemented this yet, but was thinking some people will be interested in full corpus clustering. The nice thing, then, is that as new documents come in, they can be added to existing clusters (and maybe periodically, we re-cluster). Just thinking outloud. We have two variables here: the length of docs and the number of docs. Carrot2 is suitable for small numbers of docs (up to say 1000). If the docs are short (a paragraph or so), the clustering should be pretty fast, suitable for on-line processing (see: http://project.carrot2.org/algorithms.html). If the documents get longer, Carrot2 will still handle them, but will require some more time for processing, I'll try to do some measurements. But C2 is not useful for the whole collection case -- it performs all processing in-memory and here we'd need a totally different class of algorithm, something along the lines of Mahout developments. bq. Hmm, that's an interesting thought. We could check to see if highlighting is done first. To quickly summarise the pros and cons of relying on highlighting being done outside of the clustering component: Pros: * we avoid duplication of processing (highlighting being done twice) * simpler code of the clustering component, less configuration Cons: * if someone doesn't want highlighting in the search results, the clustering is likely to take more time (because it operates on full documents, and it's controlled globally) * depending on the highlighter, we may get some markup in the summaries, which may affect clustering (I'd need to check how Carrot2 handles that) bq. Should the MockClusteringAlgorithm be under the test source tree and not the main one? I moved it in the patch to follow Absolutely, it should be in the test source. bq. I don't think we need to output the number of clusters, since that will be obvious from the list size. I dropped it in the patch to follow Makes sense, I kept it because the original version had it. bq. Also, on the response structure, we certainly could make it optional, although it means having to go do a lookup in the real doc list, which could be less than fun. By lookup you mean the lookup in the XML response? Here again we have a trade off between the length of the response and ease of processing: if we repeat document titles / snippets in the clusters structure, we at least double the response size (at least because the same document may belong to many clusters), but can potentially save some lookups. But if we want to get some other fields of a document (other than we repeat in the clusters list), we'd still need a lookup. To sum up, my intuition would be to avoid duplication and stick with document ids in cluster list (this is what we do in Carrot2 XMLs as well). Optionally, the clustering component could have a list of configurable fields to be repeated in the cluster list if that's really helpful in real-word use cases. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680942#action_12680942 ] Stanislaw Osinski commented on SOLR-769: Hi All, I've just uploaded a patch that passes unit tests and has working example, but this is by no means a final version. A few outstanding questions / issues: # h4. Response structure. I was wondering -- to we need to repeat the document contents in the 'clusters' response section? Assuming that each document in the index has a unique ID, we could reduce the size of the response by just referencing documents by IDs like this: \\ {code} lst name=clusters int name=numClusters3/int lst name=cluster lst name=labels str name=labelGPU VPU Clocked/str /lst lst name=docs str name=docEN7800GTX/2DHTV/256M/str str name=doc100-435805/str /lst /lst lst name=cluster lst name=labels str name=labelHard Drive/str /lst lst name=docs str name=doc6H500F0/str str name=docSP2514N/str /lst /lst lst name=cluster lst name=labels str name=labelOther Topics/str /lst lst name=docs str name=doc9885A004/str /lst /lst {code} Actually, this is what I've implemented in the patch. Also, in case of hierarchical clusters I've introduced a grouping entity called clusters so that the top- and sub-levels or the response are consistent (see unit tests). Please let me know if this makes sense. # h4 Build: compile warnings about missing SimpleXML SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not needed at runtime, but generates warnings about missing dependencies during compile time. So the option is either to live with the warnings or to add SimpleXML (version 1.7.2) to get rid of the warnings. # h4 Build: copying of protowords.txt etc The patch includes lexical files both in the contrib/clustering/src/java/test/resources/ and in the examples dir. I'm not sure how this is handled though -- do you keep copies in the repository or copy those somehow in the build? # h4 Highlighting This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly well handle full documents (up to say a few hundred kB each), it's just the number of documents that must be in the order of hundreds. Therefore, highlighting is not mandatory, but it may sometimes improve the quality of clusters. I was wondering, if highlighting is performed earlier in the Solr pipeline, could this be reused during clustering? One possible approach could be that clustering uses whatever is fed from the pipeline: if highlighting is enabled, clustering will be performed on the highlighted content, if there was no highlighting, we'd cluster full documents. Not sure if that's reasonable / possible to implement though. # h4 Documentation (wiki) updates Once we stabilise the ideas, I'm happy to update the wiki with regard to the algorithms used (Lingo/STC) and passing additional parameters. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out,
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672281#action_12672281 ] Stanislaw Osinski commented on SOLR-769: Hi Grant, I've added a Carrot2 issue referring to point 3 on your TODO list: http://issues.carrot2.org/browse/CARROT-457. I'll be looking into this over the weekend. Staszek Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642179#action_12642179 ] Bruce Ritchie commented on SOLR-769: Grant, This patch looks very promising, I can't wait to give it a try and find a way to incorporate it into a project I'm working on (when it's ready of course ... likely not till after Carrot2 3 is released though) Can you give a quick estimate as to the performance impact of enabling clustering in search results mode? In the example @ http://wiki.apache.org/solr/ClusteringFullResultsExample the query time seems pretty high and I was wondering if that was a result of this patch or something else? Thanks, Bruce Ritchie Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642182#action_12642182 ] Stanislaw Osinski commented on SOLR-769: Bruce, For performance of the clustering algorithm alone, please take a look at: http://project.carrot2.org/algorithms.html Obviously, you'd need to add the overhead of fetching the snippets / documents from the index. Not sure how many are fetched and whether they come from Solr's cache or not, so not sure if clustering or fetching time is prevailing. Cheers, Staszek Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642187#action_12642187 ] Grant Ingersoll commented on SOLR-769: -- Hi Bruce, I haven't done any perf. testing, as I've been focused on functionality first. However, I'm not sure whether that query was the first one run, or not, so I don't know the status of the searcher, etc. I'm pretty sure I don't have any warming queries, etc. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641823#action_12641823 ] Grant Ingersoll commented on SOLR-769: -- {quote}So what would be the procedure to add some clustering code beyond carrot or other available libraries. {quote} Essentially, you need to implement either a SearchClusteringEngine or a DocumentClusteringEngine and then hook declare it in the SearchComponent configuration, as is done with the Carrot2 example here: {code} lst name=engine !-- The name, only one can be named default -- str name=namedefault/str !-- Carrot2 specific parameters. See the Carrot2 site for details on setting. -- !-- carrot.algorithm: Optional. Currently only lingo is supported pending the release of Carrot2 3.0. -- str name=carrot.algorithmlingo/str !-- Lingo specific -- float name=carrot.lingo.threshold.clusterAssignment0.150/float float name=carrot.lingo.threshold.candidateClusterThreshold0.775/float /lst {code} or, in the mock setup: {code} lst name=engine !-- The name, only one can be named default -- str name=namedocEngine/str str name=classnameorg.apache.solr.handler.clustering.MockDocumentClusteringEngine/str /lst {code} If you don't declare the classname value, then it assumes the Carrot implementation. Naturally, you need to take care of all the libraries being available to Solr, etc. just as you would for any plugin. Since you are interested in clustering, Vaijanath, it would be good to get your feedback on the APIs. Are you doing full document clustering or just search snippet clustering? Also, if you are using an open source clustering library that has acceptable licensing terms (i.e. not GPL or similar), perhaps consider contributing an implementation of the engine and then we can make it available to everyone. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641832#action_12641832 ] Vaijanath N. Rao commented on SOLR-769: --- Hi Grant, Till now I have worked mostly with full document clustering. Had never thought of search snippet clustering. I will definitely pitch in for clustering library. There are many libraries which have favourable/acceptable licensing terms which can be added to Solr. --Thanks and Regards Vaijanath Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12639835#action_12639835 ] Grant Ingersoll commented on SOLR-769: -- Note, also, that even though I put in support for some of the other C2 (Carrot2) algorithms, I don't think they quite work yet. I think they require passing in more parameters to set some algorithm properties (for instance, for Fuzzy Ants, I think you need to set a depth) and I haven't figured those out yet. If you have C2 experience, insight would be appreciated. For now, stick to Lingo. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638875#action_12638875 ] Andrzej Bialecki commented on SOLR-769: FYI, Carrot2 does support a handful of different clustering algorithms (the ones I know of are Fuzzy Ants, KMeans and Suffix Tree, in addition to Lingo). Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638924#action_12638924 ] Grant Ingersoll commented on SOLR-769: -- Yeah, I probably will include the other jars and make it easy to include them. For now, I wanted to get something basic working for a talk I'm giving on Wednesday night ;-) Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638791#action_12638791 ] Grant Ingersoll commented on SOLR-769: -- Patch soon, as a start. I'm going to check in the basic directory structure and libs, and then provide a patch with the source that we can iterate on. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638814#action_12638814 ] Grant Ingersoll commented on SOLR-769: -- Still to do, more testing, get feedback, implement basics of doc. clustering. This last piece will take some more design work. Also need to validate some more that the results make sense for search results clustering, but my first look suggests they do. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638637#action_12638637 ] Grant Ingersoll commented on SOLR-769: -- Starting docs at http://wiki.apache.org/solr/ClusteringComponent Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.