Re: Solr nightly build failure
Hi there, The carrot2 download links are broken. I have file a bug with them: http://issues.carrot2.org/browse/CARROT-653 It's fixed now, thanks for the report! When Solr switches to Carrot2 v3.2.0 (https://issues.apache.org/jira/browse/SOLR-1804), which is LGPL-free, the extra build dependency on the remote resource will be gone too. Cheers, Staszek
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845441#action_12845441 ] Stanislaw Osinski commented on SOLR-1804: - Hi Robert, Lucene dependency is the only change, right? Or you also upgraded Carrot2 from e.g. 3.1 to 3.2? If the latter is the case, the number of cluster may have changed e.g. because we tuned stop words or other algorithm attributes. S. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845459#action_12845459 ] Stanislaw Osinski commented on SOLR-1804: - I was about to offer advice similar to Grant's, but wanted to wait to confirm the scope of changes. If it was only Lucene dependency update, with the assumption that the update didn't change the documents fed to Carrot2 in tests, the results shouldn't change. Carrot2 uses Lucene interfaces internally, but the tokenizer is not the standard Lucene one; so no Version.LUCENE_* issues as far as I can tell. I haven't got Solr code handy, but maybe the test performs clustering on summaries generated from the original test documents and Lucene 3.x introduces some changes in the way summaries are generated? If the clusters look reasonable, the problem is probably not critical, but still worth investigation to make sure it's not a bug of some kind. S. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845462#action_12845462 ] Stanislaw Osinski commented on SOLR-1804: - Yeah, the clusters look good. When you're done with upgrading Lucene to 3.x, we could also upgrade Carrot2 to version 3.2.0, which is LGPL-free and could be distributed together with Solr. S. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-1809) Carrot2 clustering time logging
[ https://issues.apache.org/jira/browse/SOLR-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-1809. - Resolution: Invalid Hi Erik! You're right, {{debugQuery}} should be enough for most cases. Resolving as invalid. Carrot2 clustering time logging --- Key: SOLR-1809 URL: https://issues.apache.org/jira/browse/SOLR-1809 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Fix For: 1.5 Attachments: SOLR-1809.patch It may be useful to log the amount of time Carrot2 spent on clustering. This should be helpful when debugging performance issues. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1809) Carrot2 clustering time logging
Carrot2 clustering time logging --- Key: SOLR-1809 URL: https://issues.apache.org/jira/browse/SOLR-1809 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Fix For: 1.5 It may be useful to log the amount of time Carrot2 spent on clustering. This should be helpful when debugging performance issues. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1809) Carrot2 clustering time logging
[ https://issues.apache.org/jira/browse/SOLR-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-1809: Attachment: SOLR-1809.patch An initial patch. I'm not sure what Solr's logging policies are, feel free to change the level as appropriate. Carrot2 clustering time logging --- Key: SOLR-1809 URL: https://issues.apache.org/jira/browse/SOLR-1809 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Fix For: 1.5 Attachments: SOLR-1809.patch It may be useful to log the amount of time Carrot2 spent on clustering. This should be helpful when debugging performance issues. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1692) CarrotClusteringEngine produce summary does nothing
[ https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795925#action_12795925 ] Stanislaw Osinski commented on SOLR-1692: - {quote} bq. Where should the configuration of the highlighter we use for clustering come from? We have all the code hooked in for it already, we're just ignoring the output. {quote} To avoid confusion and questions along the lines of why clusters don't match the (highlighted) documents I'm seeing, I'd suggest a slightly more elaborate scenario for the clustering highlighter configuration: 1. If main Solr highlighting is disabled, use the clustering component's highlighter settings. 2. If main Solr highlighting is enabled, use the main highlighter's configuration as the defaults and let the clustering-specific highlighter configuration override the defaults. If we do it this way, we'll minimize the chances of users accidentally performing clustering on documents different (differently highlighted) than those they will see. bq. Would be great if, Carrot2 could also just use the analysis that Lucene/Solr produces, that way it would be much easier to configure stopwords, HTML stripping, etc. This one would require some larger changes to Carrot2 internals. We do use Lucene infrastructure for preprocessing (currently for tokenization), but I can investigate if we can extend that further. A potential problem here is that very often the set of stopwords you use for document retrieval may not work equally well for clustering. I've filed a [Carrot2-specific issue|http://issues.carrot2.org/browse/CARROT-606] for it and will try to come up with something. CarrotClusteringEngine produce summary does nothing --- Key: SOLR-1692 URL: https://issues.apache.org/jira/browse/SOLR-1692 Project: Solr Issue Type: Bug Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 1.5 Attachments: SOLR-1692.patch In the CarrotClusteringEngine, the produceSummary option does nothing, as the results of doing the highlighting are just ignored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-236) Field collapsing
[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795067#action_12795067 ] Stanislaw Osinski commented on SOLR-236: Hi Grant, {quote} I would note, in looking at the Carrot2 code, they actually have a ByFieldClusteringAlgorithm (what they call synthetic clustering) which does field collapsing/clustering on a value of a field. To quote the javadocs: Clusters documents into a flat structure based on the values of some field of the documents. By default the \...@link Document#SOURCES} field is used and Name of the field to cluster by. Each non-null scalar field value with distinct hash code will give raise to a single cluster, named using the \...@link Object#toString()} value of the field. If the field value is a collection, the document will be assigned to all clusters corresponding to the values in the collection. Note that arrays will not be 'unfolded' in this way. I don't know how it performs, but it seems like it would at least be worth investigating. {quote} Carrot2's {{ByFieldClusteringAlgorithm}} is very simple. It literally throws everything into a hash map based on the field value ([source code|http://fisheye3.atlassian.com/browse/carrot2/trunk/core/carrot2-algorithm-synthetic/src/org/carrot2/clustering/synthetic/ByFieldClusteringAlgorithm.java?r=trunk#l99]). This algorithm is used in our live demo to [cluster by news source|http://search.carrot2.org/stable/search?source=boss-newsquery=iphonealgorithm=source]. {quote} Note, they also have a synthetic one for collapsing based on URL: ByUrlClusteringAlgorithm {quote} This one creates a [hierarchy based on the URL segments|http://search.carrot2.org/stable/search?source=boss-webquery=solralgorithm=urlresults=200] and might be useful to create by-domain collapsing if needed. In general, my rough guess is that it's the criteria for content-based collapsing would be closer to duplicate detection rather than the type of grouping Carrot2 produces. Field collapsing Key: SOLR-236 URL: https://issues.apache.org/jira/browse/SOLR-236 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Emmanuel Keller Assignee: Shalin Shekhar Mangar Fix For: 1.5 Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, quasidistributed.additional.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch This patch include a new feature called Field collapsing. Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated more documents from this site link. See also Duplicate detection. http://www.fastsearch.com/glossary.aspx?m=48amid=299 The implementation add 3 new query parameters (SolrParams): collapse.field to choose the field used to group results collapse.type normal (default value) or adjacent collapse.max to select how many continuous results are allowed before collapsing TODO (in progress): - More documentation (on source code) - Test cases Two patches: - field_collapsing.patch for current development version - field_collapsing_1.1.0.patch for Solr-1.1.0 P.S.: Feedback and misspelling correction are welcome ;-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12760238#action_12760238 ] Stanislaw Osinski commented on SOLR-1314: - The required change is right at the end of the big diff: {noformat} Index: contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java === --- contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java (revision 819270) +++ contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java (working copy) @@ -40,11 +40,11 @@ @SuppressWarnings(unchecked) public class CarrotClusteringEngineTest extends AbstractClusteringTest { public void testCarrotLingo() throws Exception { -checkEngine(getClusteringEngine(default), 9); +checkEngine(getClusteringEngine(default), 10); } public void testCarrotStc() throws Exception { -checkEngine(getClusteringEngine(stc), 2); +checkEngine(getClusteringEngine(stc), 1); } public void testWithoutSubclusters() throws Exception { {noformat} Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 Attachments: SOLR-1314.patch As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-1314: Attachment: SOLR-1314.patch Hi Grant, I've built Carrot2 3.1.0 binaries and tested them with Solr trunk. Attached is a patch that upgrades the libs to Carrot2 3.1.0 and fixes one unit test. S. Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 Attachments: SOLR-1314.patch As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12759667#action_12759667 ] Stanislaw Osinski commented on SOLR-1314: - Hi Grant, bq. Now that Lucene is final, can we finalize the jar for this one? Sure, over the weekend we'll be making an official Carrot2 3.1.0 release. As part of that process I'll check if the Solr plugin is working fine and will post the final JAR here. bq. Also, this final JAR will handle the license and FastVector stuff, right? Correct. The following commit removed it from trunk and hence the 3.1.0 release: http://fisheye3.atlassian.com/changelog/carrot2/?cs=3694 S. Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758843#action_12758843 ] Stanislaw Osinski commented on SOLR-1314: - Hi Grant, I've made Carrot2's dependency on Smart Chinese Analyzer optional, so no exceptions should be thrown when the big JAR is not in the classpath. As usual, download from here: http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/ S. Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756110#action_12756110 ] Stanislaw Osinski commented on SOLR-1314: - Hi Grant, I've just dropped the patenting clause entirely. The updated license is in the repo and at: http://www.carrot2.org/carrot2.LICENSE. S. Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer
[ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756177#action_12756177 ] Stanislaw Osinski commented on SOLR-1336: - Keeping the Chinese analyzer JAR optional sounds good. As Carrot2 also uses it, I'd need to make sure the clustering contrib doesn't fail when the JAR is not there and clustering in Chinese is requested (I think I'd simply log a WARN saying that the Chinese analyzer JAR is required for best clustering results). Add support for lucene's SmartChineseAnalyzer - Key: SOLR-1336 URL: https://issues.apache.org/jira/browse/SOLR-1336 Project: Solr Issue Type: New Feature Components: Analysis Reporter: Robert Muir Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words. if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list. this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update. it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755657#action_12755657 ] Stanislaw Osinski commented on SOLR-1314: - As a follow-up of the discussion on legal-discuss, I've removed the dependency on {{FastVector}} from Carrot2's STC algorithm. The binaries are in the usual place: http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/ Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754699#action_12754699 ] Stanislaw Osinski commented on SOLR-1314: - Good point, Grant. Though the classes we included are merely definitions of native methods, it's better to keep them separate. I've just reverted back to a separate {{nni.jar}}, binaries are here: http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/ Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754593#action_12754593 ] Stanislaw Osinski commented on SOLR-1314: - Let me build C2 with Lucene 2.9 RC4, will post a download URL in a while. Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754597#action_12754597 ] Stanislaw Osinski commented on SOLR-1314: - Hi Grant, Here's Carrot2 3.1-dev built with Lucene 2.9-rc4: http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/ Please note a few things about the dependencies: * {{nni.jar}} is now part of {{carrot2-mini.jar}}, so no need to download it separately * dependencies upgraded to the newer versions (http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/carrot2-mini-3.1-dev.pom), Lucene entry in the POM still needs to be upgraded for version 2.9 * Carrot2 provides experimental support for Chinese Simplified based on the smart cn analyzer -- does Solr distribute that JAR by default? Please let me know if you have any problems upgrading. S. Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736030#action_12736030 ] Stanislaw Osinski commented on SOLR-769: Hi Grant, There's one more thing: we're planning to release version 3.1.0 of Carrot2 with certain bug fixes in clustering algorithm and better support for Chinese (using the new analyzer from Lucene). Our plan is to release after Lucene 2.9 is out, but before Solr 1.4, so that the latter would have a newer version of Carrot2 on board (should be just a matter of replacing Carrot2 JAR / upgrading version of the downloaded dependency). Would that make sense? Should I create a separate issue for it, or rather reopen this one? Thanks, S. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736039#action_12736039 ] Stanislaw Osinski commented on SOLR-769: Created: SOLR-1314. I'll attach a patch there as soon as Lucene 2.9 is released. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: --- Attachment: subcluster-flattening.patch Hi, While configuring the clustering component for an algorithm that returns hierarchical clusters, it took me a while to debug why subclusters wouldn't appear on the output. It turned out that the default value for the {{carrot.outputSubClusters}} parameter is {{false}}, which was the opposite to what I assumed :-) Would it be a problem to change the default to {{true}}, so that other users avoid the same problem? Another improvement worth making for the {{carrot.outputSubClusters}} = {{false}} case is flattening the clusters: returning all documents of the 1st level clusters, including those contained in the subclusters the user chose not to output. Without this improvement, many document-cluster assignments may be lost because some Carrot2 algorithms will assign documents only to the leaf (deepest in the hierarchy) clusters. I'm attaching a patch that implements both changes. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725739#action_12725739 ] Stanislaw Osinski commented on SOLR-769: bq. Is labels is needed because there could be multiple labels per cluster in the future? ( I assume yes) Correct. Currently neither of Carrot2's algorithms creates clusters with multiple labels, but it's quite likely that there are other algorithms that can do that. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: clustering unit test failures on ubuntu
Hi Mark, Weird. I don't have Ubuntu handy, but it looks like it is having problems with JUnit itself, unless I'm misreading the exception. Weird indeed. One thing that randomly springs to my mind is the reflection magic Carrot2 is using here and there, which might be a problem when you enable the security manager. For a quick test, could you try if Carrot2 tests pass on your machine (we're using JUnit 4 too): svn co https://carrot2.svn.sourceforge.net/svnroot/carrot2/trunk cd trunk echo external.api.tests.disabled=true local.properties ant test Thanks, Staszek
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712534#action_12712534 ] Stanislaw Osinski commented on SOLR-769: In fact, you can set Carrot2 attributes (both init- and request-time) in the solr config file, this should work also without the patch. Just add: {{str name=Tokenizer.analyzerfully.qualified.class.Name/str}} to the search component element. See http://wiki.apache.org/solr/ClusteringComponent for some example. You'll find list of Carrot2 attributes, their ids and description at: http://download.carrot2.org/stable/manual/#chapter.components. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712545#action_12712545 ] Stanislaw Osinski commented on SOLR-769: Ah, I should have mentioned that up front -- Carrot2 will try to convert the string into the type accepted by the attribute. In case of the class-types attributes, it will try to load the class using the current thread's context classloader. Conversions are also available for numeric, boolean and enum attributes (see: http://download.carrot2.org/head/javadoc/org/carrot2/util/attribute/AttributeBinder.AttributeTransformerFromString.html). Please let me know if that way works for you. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712421#action_12712421 ] Stanislaw Osinski commented on SOLR-769: Pasting the comment I made on the list: The catch with analyzer is that this specific attribute is an initialization-time attribute, so you need to add it to the {{initAttributes}} map in the {{init()}} method of {{CarrotClusteringEngine}}. Please let me know if this solves the problem. If not, I'll investigate further. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710087#action_12710087 ] Stanislaw Osinski commented on SOLR-769: Thanks Grant! Looking forward to seeing the code in the repo! S. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695463#action_12695463 ] Stanislaw Osinski commented on SOLR-769: Hi Grant, If you download http://download.carrot2.org/stable/carrot2-java-api-3.0.1.zip, you'll find licenses in the lib/ folder of the distribution. That distribution contains slightly more JARs than needed for Solr (which uses carrot2-mini.jar), so you'd need to pick only those that are relevant. S. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688171#action_12688171 ] Stanislaw Osinski commented on SOLR-769: bq. Also, you say C2 can handle full docs, is it feasible, then to implement it for the offline mode I have in mind, whereby you cluster the whole collection offline and then store the clusters for retrieval? I haven't implemented this yet, but was thinking some people will be interested in full corpus clustering. The nice thing, then, is that as new documents come in, they can be added to existing clusters (and maybe periodically, we re-cluster). Just thinking outloud. We have two variables here: the length of docs and the number of docs. Carrot2 is suitable for small numbers of docs (up to say 1000). If the docs are short (a paragraph or so), the clustering should be pretty fast, suitable for on-line processing (see: http://project.carrot2.org/algorithms.html). If the documents get longer, Carrot2 will still handle them, but will require some more time for processing, I'll try to do some measurements. But C2 is not useful for the whole collection case -- it performs all processing in-memory and here we'd need a totally different class of algorithm, something along the lines of Mahout developments. bq. Hmm, that's an interesting thought. We could check to see if highlighting is done first. To quickly summarise the pros and cons of relying on highlighting being done outside of the clustering component: Pros: * we avoid duplication of processing (highlighting being done twice) * simpler code of the clustering component, less configuration Cons: * if someone doesn't want highlighting in the search results, the clustering is likely to take more time (because it operates on full documents, and it's controlled globally) * depending on the highlighter, we may get some markup in the summaries, which may affect clustering (I'd need to check how Carrot2 handles that) bq. Should the MockClusteringAlgorithm be under the test source tree and not the main one? I moved it in the patch to follow Absolutely, it should be in the test source. bq. I don't think we need to output the number of clusters, since that will be obvious from the list size. I dropped it in the patch to follow Makes sense, I kept it because the original version had it. bq. Also, on the response structure, we certainly could make it optional, although it means having to go do a lookup in the real doc list, which could be less than fun. By lookup you mean the lookup in the XML response? Here again we have a trade off between the length of the response and ease of processing: if we repeat document titles / snippets in the clusters structure, we at least double the response size (at least because the same document may belong to many clusters), but can potentially save some lookups. But if we want to get some other fields of a document (other than we repeat in the clusters list), we'd still need a lookup. To sum up, my intuition would be to avoid duplication and stick with document ids in cluster list (this is what we do in Carrot2 XMLs as well). Optionally, the clustering component could have a list of configurable fields to be repeated in the cluster list if that's really helpful in real-word use cases. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids
Re: Release 1.4
Hi, I'd also like to get clustering in, SOLR-769. I will have a some time for it after ApacheCon (or, possibly, during) It would be great to see Carrot2 clustering in Solr! Of the remaining tasks I listed in SOLR-769, I can update the Wiki page to match the current implementation and do some more clean-ups to the code.I think I should be able to handle these before you start your work (before ApacheCon starts), but if you happen to start earlier, please let me know so that we don't overwrite each other's changes. Cheers, S.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: --- Attachment: (was: SOLR-769.patch) Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: --- Attachment: SOLR-769.zip Further code clean-ups, support for passing intialization-time attributes to Carrot2 algorithms, some comments in the example configuration file. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: --- Attachment: (was: SOLR-769-lib.zip) Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: --- Attachment: SOLR-769-lib.zip Libs with Carrot2 v3.0.1 we've just released. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: --- Attachment: SOLR-769-lib.zip SOLR-769.patch Yet another patch, this time with passing unit tests and working example. Will make some more comments in a sec. Please use SOLR-769-lib.zip libs with this patch. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680942#action_12680942 ] Stanislaw Osinski commented on SOLR-769: Hi All, I've just uploaded a patch that passes unit tests and has working example, but this is by no means a final version. A few outstanding questions / issues: # h4. Response structure. I was wondering -- to we need to repeat the document contents in the 'clusters' response section? Assuming that each document in the index has a unique ID, we could reduce the size of the response by just referencing documents by IDs like this: \\ {code} lst name=clusters int name=numClusters3/int lst name=cluster lst name=labels str name=labelGPU VPU Clocked/str /lst lst name=docs str name=docEN7800GTX/2DHTV/256M/str str name=doc100-435805/str /lst /lst lst name=cluster lst name=labels str name=labelHard Drive/str /lst lst name=docs str name=doc6H500F0/str str name=docSP2514N/str /lst /lst lst name=cluster lst name=labels str name=labelOther Topics/str /lst lst name=docs str name=doc9885A004/str /lst /lst {code} Actually, this is what I've implemented in the patch. Also, in case of hierarchical clusters I've introduced a grouping entity called clusters so that the top- and sub-levels or the response are consistent (see unit tests). Please let me know if this makes sense. # h4 Build: compile warnings about missing SimpleXML SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not needed at runtime, but generates warnings about missing dependencies during compile time. So the option is either to live with the warnings or to add SimpleXML (version 1.7.2) to get rid of the warnings. # h4 Build: copying of protowords.txt etc The patch includes lexical files both in the contrib/clustering/src/java/test/resources/ and in the examples dir. I'm not sure how this is handled though -- do you keep copies in the repository or copy those somehow in the build? # h4 Highlighting This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly well handle full documents (up to say a few hundred kB each), it's just the number of documents that must be in the order of hundreds. Therefore, highlighting is not mandatory, but it may sometimes improve the quality of clusters. I was wondering, if highlighting is performed earlier in the Solr pipeline, could this be reused during clustering? One possible approach could be that clustering uses whatever is fed from the pipeline: if highlighting is enabled, clustering will be performed on the highlighted content, if there was no highlighting, we'd cluster full documents. Not sure if that's reasonable / possible to implement though. # h4 Documentation (wiki) updates Once we stabilise the ideas, I'm happy to update the wiki with regard to the algorithms used (Lingo/STC) and passing additional parameters. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out
[jira] Issue Comment Edited: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680942#action_12680942 ] Stanislaw Osinski edited comment on SOLR-769 at 3/11/09 10:46 AM: -- Hi All, I've just uploaded a patch that passes unit tests and has working example, but this is by no means a final version. A few outstanding questions / issues: 1. Response structure. I was wondering -- to we need to repeat the document contents in the 'clusters' response section? Assuming that each document in the index has a unique ID, we could reduce the size of the response by just referencing documents by IDs like this: \\ {code} lst name=clusters int name=numClusters3/int lst name=cluster lst name=labels str name=labelGPU VPU Clocked/str /lst lst name=docs str name=docEN7800GTX/2DHTV/256M/str str name=doc100-435805/str /lst /lst lst name=cluster lst name=labels str name=labelHard Drive/str /lst lst name=docs str name=doc6H500F0/str str name=docSP2514N/str /lst /lst lst name=cluster lst name=labels str name=labelOther Topics/str /lst lst name=docs str name=doc9885A004/str /lst /lst {code} Actually, this is what I've implemented in the patch. Also, in case of hierarchical clusters I've introduced a grouping entity called clusters so that the top- and sub-levels or the response are consistent (see unit tests). Please let me know if this makes sense. \\ \\ \\ 2. Build: compile warnings about missing SimpleXML SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not needed at runtime, but generates warnings about missing dependencies during compile time. So the option is either to live with the warnings or to add SimpleXML (version 1.7.2) to get rid of the warnings. \\ \\ \\ 3. Build: copying of protowords.txt etc The patch includes lexical files both in the contrib/clustering/src/java/test/resources/ and in the examples dir. I'm not sure how this is handled though -- do you keep copies in the repository or copy those somehow in the build? \\ \\ \\ 4. Highlighting This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly well handle full documents (up to say a few hundred kB each), it's just the number of documents that must be in the order of hundreds. Therefore, highlighting is not mandatory, but it may sometimes improve the quality of clusters. I was wondering, if highlighting is performed earlier in the Solr pipeline, could this be reused during clustering? One possible approach could be that clustering uses whatever is fed from the pipeline: if highlighting is enabled, clustering will be performed on the highlighted content, if there was no highlighting, we'd cluster full documents. Not sure if that's reasonable / possible to implement though. \\ \\ \\ 5. Documentation (wiki) updates Once we stabilise the ideas, I'm happy to update the wiki with regard to the algorithms used (Lingo/STC) and passing additional parameters. was (Author: stanislaw.osinski): Hi All, I've just uploaded a patch that passes unit tests and has working example, but this is by no means a final version. A few outstanding questions / issues: # h4. Response structure. I was wondering -- to we need to repeat the document contents in the 'clusters' response section? Assuming that each document in the index has a unique ID, we could reduce the size of the response by just referencing documents by IDs like this: \\ {code} lst name=clusters int name=numClusters3/int lst name=cluster lst name=labels str name=labelGPU VPU Clocked/str /lst lst name=docs str name=docEN7800GTX/2DHTV/256M/str str name=doc100-435805/str /lst /lst lst name=cluster lst name=labels str name=labelHard Drive/str /lst lst name=docs str name=doc6H500F0/str str name=docSP2514N/str /lst /lst lst name=cluster lst name=labels str name=labelOther Topics/str /lst lst name=docs str name=doc9885A004/str /lst /lst {code} Actually, this is what I've implemented in the patch. Also, in case of hierarchical clusters I've introduced a grouping entity called clusters so that the top- and sub-levels or the response are consistent (see unit tests). Please let me know if this makes sense. # h4 Build: compile warnings about missing SimpleXML SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not needed at runtime, but generates warnings about missing dependencies during compile time. So the option is either to live with the warnings or to add SimpleXML (version 1.7.2) to get rid of the warnings. # h4 Build: copying of protowords.txt etc The patch includes lexical files both in the contrib/clustering/src/java/test/resources/ and in the examples dir. I'm not sure how this is handled though -- do you keep copies
Re: Issue using SOLR-769 Patch for Clustering
Unfortunately, the latest patch on that issue doesn't work yet (it should compile) as I am updating it to use Carrot2 3.0. I am in the process of trying to get it working. If you are happy with using Carrot 2.x, then the previous patch on there should work. Hi Grant, I also had a few spare cycles to get the plugin to work -- I have a local copy (based on the current patch) that works and passes unit tests, but I'd still need some time to polish the code, integrate with the build script. I could try to produce a first workable patch (with tests passing etc.) tomorrow evening. I hope we didn't duplicate the work too much... S.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672281#action_12672281 ] Stanislaw Osinski commented on SOLR-769: Hi Grant, I've added a Carrot2 issue referring to point 3 on your TODO list: http://issues.carrot2.org/browse/CARROT-457. I'll be looking into this over the weekend. Staszek Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642182#action_12642182 ] Stanislaw Osinski commented on SOLR-769: Bruce, For performance of the clustering algorithm alone, please take a look at: http://project.carrot2.org/algorithms.html Obviously, you'd need to add the overhead of fetching the snippets / documents from the index. Not sure how many are fetched and whether they come from Solr's cache or not, so not sure if clustering or fetching time is prevailing. Cheers, Staszek Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.