[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845441#action_12845441
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

Hi Robert,

Lucene dependency is the only change, right? Or you also upgraded Carrot2 from 
e.g. 3.1 to 3.2? If the latter is the case, the number of cluster may have 
changed e.g. because we tuned stop words or other algorithm attributes.

S.



 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845459#action_12845459
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

I was about to offer advice similar to Grant's, but wanted to wait to confirm 
the scope of changes.

If it was only Lucene dependency update, with the assumption that the update 
didn't change the documents fed to Carrot2 in tests, the results shouldn't 
change. Carrot2 uses Lucene interfaces internally, but the tokenizer is not the 
standard Lucene one; so no Version.LUCENE_* issues as far as I can tell.

I haven't got Solr code handy, but maybe the test performs clustering on 
summaries generated from the original test documents and Lucene 3.x introduces 
some changes in the way summaries are generated?

If the clusters look reasonable, the problem is probably not critical, but 
still worth investigation to make sure it's not a bug of some kind.

S.


 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845462#action_12845462
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

Yeah, the clusters look good. When you're done with upgrading Lucene to 3.x, we 
could also upgrade Carrot2 to version 3.2.0, which is LGPL-free and could be 
distributed together with Solr.

S.

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1809) Carrot2 clustering time logging

2010-03-07 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-1809.
-

Resolution: Invalid

Hi Erik! You're right, {{debugQuery}} should be enough for most cases. 
Resolving as invalid.

 Carrot2 clustering time logging
 ---

 Key: SOLR-1809
 URL: https://issues.apache.org/jira/browse/SOLR-1809
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
 Fix For: 1.5

 Attachments: SOLR-1809.patch


 It may be useful to log the amount of time Carrot2 spent on clustering. This 
 should be helpful when debugging performance issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1809) Carrot2 clustering time logging

2010-03-05 Thread Stanislaw Osinski (JIRA)
Carrot2 clustering time logging
---

 Key: SOLR-1809
 URL: https://issues.apache.org/jira/browse/SOLR-1809
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
 Fix For: 1.5


It may be useful to log the amount of time Carrot2 spent on clustering. This 
should be helpful when debugging performance issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1809) Carrot2 clustering time logging

2010-03-05 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1809:


Attachment: SOLR-1809.patch

An initial patch. I'm not sure what Solr's logging policies are, feel free to 
change the level as appropriate.

 Carrot2 clustering time logging
 ---

 Key: SOLR-1809
 URL: https://issues.apache.org/jira/browse/SOLR-1809
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
 Fix For: 1.5

 Attachments: SOLR-1809.patch


 It may be useful to log the amount of time Carrot2 spent on clustering. This 
 should be helpful when debugging performance issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1692) CarrotClusteringEngine produce summary does nothing

2010-01-02 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795925#action_12795925
 ] 

Stanislaw Osinski commented on SOLR-1692:
-

{quote}
bq. Where should the configuration of the highlighter we use for clustering 
come from?

We have all the code hooked in for it already, we're just ignoring the output.
{quote}

To avoid confusion and questions along the lines of why clusters don't match 
the (highlighted) documents I'm seeing, I'd suggest a slightly more elaborate 
scenario for the clustering highlighter configuration:

1. If main Solr highlighting is disabled, use the clustering component's 
highlighter settings.
2. If main Solr highlighting is enabled, use the main highlighter's 
configuration as the defaults and let the clustering-specific highlighter 
configuration override the defaults.

If we do it this way, we'll minimize the chances of users accidentally 
performing clustering on documents different (differently highlighted) than 
those they will see.

bq. Would be great if, Carrot2 could also just use the analysis that 
Lucene/Solr produces, that way it would be much easier to configure stopwords, 
HTML stripping, etc.

This one would require some larger changes to Carrot2 internals. We do use 
Lucene infrastructure for preprocessing (currently for tokenization), but I can 
investigate if we can extend that further. A potential problem here is that 
very often the set of stopwords you use for document retrieval may not work 
equally well for clustering. I've filed a [Carrot2-specific 
issue|http://issues.carrot2.org/browse/CARROT-606] for it and will try to come 
up with something.

 CarrotClusteringEngine produce summary does nothing
 ---

 Key: SOLR-1692
 URL: https://issues.apache.org/jira/browse/SOLR-1692
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 1.5

 Attachments: SOLR-1692.patch


 In the CarrotClusteringEngine, the produceSummary option does nothing, as the 
 results of doing the highlighting are just ignored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-236) Field collapsing

2009-12-29 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795067#action_12795067
 ] 

Stanislaw Osinski commented on SOLR-236:


Hi Grant,

{quote}
I would note, in looking at the Carrot2 code, they actually have a 
ByFieldClusteringAlgorithm (what they call synthetic clustering) which does 
field collapsing/clustering on a value of a field. To quote the javadocs:

Clusters documents into a flat structure based on the values of some field of 
the documents. By default the \...@link Document#SOURCES} field is used and  
Name of the field to cluster by. Each non-null scalar field value with distinct 
hash code will give raise to a single cluster, named using the \...@link 
Object#toString()} value of the field. If the field value is a collection, the 
document will be assigned to all clusters corresponding to the values in the 
collection. Note that arrays will not be 'unfolded' in this way.

I don't know how it performs, but it seems like it would at least be worth 
investigating.
{quote}

Carrot2's {{ByFieldClusteringAlgorithm}} is very simple. It literally throws 
everything into a hash map based on the field value ([source 
code|http://fisheye3.atlassian.com/browse/carrot2/trunk/core/carrot2-algorithm-synthetic/src/org/carrot2/clustering/synthetic/ByFieldClusteringAlgorithm.java?r=trunk#l99]).
 This algorithm is used in our live demo to [cluster by news 
source|http://search.carrot2.org/stable/search?source=boss-newsquery=iphonealgorithm=source].

{quote}
Note, they also have a synthetic one for collapsing based on URL: 
ByUrlClusteringAlgorithm
{quote}

This one creates a [hierarchy based on the URL 
segments|http://search.carrot2.org/stable/search?source=boss-webquery=solralgorithm=urlresults=200]
 and might be useful to create by-domain collapsing if needed.

In general, my rough guess is that it's the criteria for content-based 
collapsing would be closer to duplicate detection rather than the type of 
grouping Carrot2 produces.

 Field collapsing
 

 Key: SOLR-236
 URL: https://issues.apache.org/jira/browse/SOLR-236
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Emmanuel Keller
Assignee: Shalin Shekhar Mangar
 Fix For: 1.5

 Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
 collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
 collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, 
 field-collapse-4-with-solrj.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
 field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
 field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
 field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
 quasidistributed.additional.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, 
 SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, 
 SOLR-236_collapsing.patch


 This patch include a new feature called Field collapsing.
 Used in order to collapse a group of results with similar value for a given 
 field to a single entry in the result set. Site collapsing is a special case 
 of this, where all results for a given web site is collapsed into one or two 
 entries in the result set, typically with an associated more documents from 
 this site link. See also Duplicate detection.
 http://www.fastsearch.com/glossary.aspx?m=48amid=299
 The implementation add 3 new query parameters (SolrParams):
 collapse.field to choose the field used to group results
 collapse.type normal (default value) or adjacent
 collapse.max to select how many continuous results are allowed before 
 collapsing
 TODO (in progress):
 - More documentation (on source code)
 - Test cases
 Two patches:
 - field_collapsing.patch for current development version
 - field_collapsing_1.1.0.patch for Solr-1.1.0
 P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-28 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12760238#action_12760238
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

The required change is right at the end of the big diff:

{noformat}
Index: 
contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java
===
--- 
contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java
 (revision 819270)
+++ 
contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java
 (working copy)
@@ -40,11 +40,11 @@
 @SuppressWarnings(unchecked)
 public class CarrotClusteringEngineTest extends AbstractClusteringTest {
   public void testCarrotLingo() throws Exception {
-checkEngine(getClusteringEngine(default), 9);
+checkEngine(getClusteringEngine(default), 10);
   }
 
   public void testCarrotStc() throws Exception {
-checkEngine(getClusteringEngine(stc), 2);
+checkEngine(getClusteringEngine(stc), 1);
   }
 
   public void testWithoutSubclusters() throws Exception {
{noformat}

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4

 Attachments: SOLR-1314.patch


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-27 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1314:


Attachment: SOLR-1314.patch

Hi Grant,

I've built Carrot2 3.1.0 binaries and tested them with Solr trunk. Attached is 
a patch that upgrades the libs to Carrot2 3.1.0 and fixes one unit test.

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4

 Attachments: SOLR-1314.patch


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-25 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12759667#action_12759667
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Hi Grant,

bq. Now that Lucene is final, can we finalize the jar for this one? 

Sure, over the weekend we'll be making an official Carrot2 3.1.0 release. As 
part of that process I'll check if the Solr plugin is working fine and will 
post the final JAR here.

bq. Also, this final JAR will handle the license and FastVector stuff, right?

Correct. The following commit removed it from trunk and hence the 3.1.0 release:

http://fisheye3.atlassian.com/changelog/carrot2/?cs=3694

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-23 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758843#action_12758843
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Hi Grant,

I've made Carrot2's dependency on Smart Chinese Analyzer optional, so no 
exceptions should be thrown when the big JAR is not in the classpath. As usual, 
download from here:

http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-16 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756110#action_12756110
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Hi Grant,

I've just dropped the patenting clause entirely. The updated license is in the 
repo and at: http://www.carrot2.org/carrot2.LICENSE.

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

2009-09-16 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756177#action_12756177
 ] 

Stanislaw Osinski commented on SOLR-1336:
-

Keeping the Chinese analyzer JAR optional sounds good. As Carrot2 also uses it, 
I'd need to make sure the clustering contrib doesn't fail when the JAR is not 
there and clustering in Chinese is requested (I think I'd simply log a WARN 
saying that the Chinese analyzer JAR is required for best clustering results).

 Add support for lucene's SmartChineseAnalyzer
 -

 Key: SOLR-1336
 URL: https://issues.apache.org/jira/browse/SOLR-1336
 Project: Solr
  Issue Type: New Feature
  Components: Analysis
Reporter: Robert Muir
 Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch


 SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese 
 text as words.
 if the factories for the tokenizer and word token filter are added to solr it 
 can be used, although there should be a sample config or wiki entry showing 
 how to apply the built-in stopwords list.
 this is because it doesn't contain actual stopwords, but must be used to 
 prevent indexing punctuation... 
 note: we did some refactoring/cleanup on this analyzer recently, so it would 
 be much easier to do this after the next lucene update.
 it has also been moved out of -analyzers.jar due to size, and now builds in 
 its own smartcn jar file, so that would need to be added if this feature is 
 desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-15 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755657#action_12755657
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

As a follow-up of the discussion on legal-discuss, I've removed the dependency 
on {{FastVector}} from Carrot2's STC algorithm. The binaries are in the usual 
place:

http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-13 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754699#action_12754699
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Good point, Grant. Though the classes we included are merely definitions of 
native methods, it's better to keep them separate. I've just reverted back to a 
separate {{nni.jar}}, binaries are here:

http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-12 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754593#action_12754593
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Let me build C2 with Lucene 2.9 RC4, will post a download URL in a while.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-12 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754597#action_12754597
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Hi Grant,

Here's Carrot2 3.1-dev built with Lucene 2.9-rc4:

http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/

Please note a few things about the dependencies:

* {{nni.jar}} is now part of {{carrot2-mini.jar}}, so no need to download it 
separately
* dependencies upgraded to the newer versions 
(http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/carrot2-mini-3.1-dev.pom),
 Lucene entry in the POM still needs to be upgraded for version 2.9
* Carrot2 provides experimental support for Chinese Simplified based on the 
smart cn analyzer -- does Solr distribute that JAR by default?

Please let me know if you have any problems upgrading.

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-28 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736030#action_12736030
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi Grant,

There's one more thing: we're planning to release version 3.1.0 of Carrot2 with 
certain bug fixes in clustering algorithm and better support for Chinese (using 
the new analyzer from Lucene). Our plan is to release after Lucene 2.9 is out, 
but before Solr 1.4, so that the latter would have a newer version of Carrot2 
on board (should be just a matter of replacing Carrot2 JAR / upgrading version 
of the downloaded dependency). Would that make sense? Should I create a 
separate issue for it, or rather reopen this one?

Thanks,

S.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-07-28 Thread Stanislaw Osinski (JIRA)
Upgrade Carrot2 to version 3.1.0


 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
 Fix For: 1.4


As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
in clustering algorithms and improved clustering in Chinese. The upgrade should 
be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-28 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736039#action_12736039
 ] 

Stanislaw Osinski commented on SOLR-769:


Created: SOLR-1314. I'll attach a patch there as soon as Lucene 2.9 is released.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-07-08 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-769:
---

Attachment: subcluster-flattening.patch

Hi,

While configuring the clustering component for an algorithm that returns 
hierarchical clusters, it took me a while to debug why subclusters wouldn't 
appear on the output. It turned out that the default value for the 
{{carrot.outputSubClusters}} parameter is {{false}}, which was the opposite to 
what I assumed :-) Would it be a problem to change the default to {{true}}, so 
that other users avoid the same problem? 

Another improvement worth making for the {{carrot.outputSubClusters}} = 
{{false}} case is flattening the clusters: returning all documents of the 1st 
level clusters, including those contained in the subclusters the user chose not 
to output. Without this improvement, many document-cluster assignments may be 
lost because some Carrot2 algorithms will assign documents only to the leaf 
(deepest in the hierarchy) clusters.

I'm attaching a patch that implements both changes.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725739#action_12725739
 ] 

Stanislaw Osinski commented on SOLR-769:


bq. Is labels is needed because there could be multiple labels per cluster in 
the future? ( I assume yes)

Correct. Currently neither of Carrot2's algorithms creates clusters with 
multiple labels, but it's quite likely that there are other algorithms that can 
do that.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712534#action_12712534
 ] 

Stanislaw Osinski commented on SOLR-769:


In fact, you can set Carrot2 attributes (both init- and request-time) in the 
solr config file, this should work also without the patch. Just add:

{{str name=Tokenizer.analyzerfully.qualified.class.Name/str}}

to the search component element. See 
http://wiki.apache.org/solr/ClusteringComponent for some example. You'll find 
list of Carrot2 attributes, their ids and description at: 
http://download.carrot2.org/stable/manual/#chapter.components.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712545#action_12712545
 ] 

Stanislaw Osinski commented on SOLR-769:


Ah, I should have mentioned that up front -- Carrot2 will try to convert the 
string into the type accepted by the attribute. In case of the class-types 
attributes, it will try to load the class using the current thread's context 
classloader. Conversions are also available for numeric, boolean and enum 
attributes (see: 
http://download.carrot2.org/head/javadoc/org/carrot2/util/attribute/AttributeBinder.AttributeTransformerFromString.html).
 Please let me know if that way works for you.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-23 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712421#action_12712421
 ] 

Stanislaw Osinski commented on SOLR-769:


Pasting the comment I made on the list:

The catch with analyzer is that this specific attribute is an 
initialization-time attribute, so you need to add it to the {{initAttributes}} 
map in the {{init()}} method of {{CarrotClusteringEngine}}.

Please let me know if this solves the problem. If not, I'll investigate further.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-16 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710087#action_12710087
 ] 

Stanislaw Osinski commented on SOLR-769:


Thanks Grant! Looking forward to seeing the code in the repo!

S.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, 
 SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-04-03 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695463#action_12695463
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi Grant,

If you download http://download.carrot2.org/stable/carrot2-java-api-3.0.1.zip, 
you'll find licenses in the lib/ folder of the distribution. That distribution 
contains slightly more JARs than needed for Solr (which uses carrot2-mini.jar), 
so you'd need to pick only those that are relevant.

S.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-22 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688171#action_12688171
 ] 

Stanislaw Osinski commented on SOLR-769:


bq. Also, you say C2 can handle full docs, is it feasible, then to implement it 
for the offline mode I have in mind, whereby you cluster the whole collection 
offline and then store the clusters for retrieval? I haven't implemented this 
yet, but was thinking some people will be interested in full corpus clustering. 
The nice thing, then, is that as new documents come in, they can be added to 
existing clusters (and maybe periodically, we re-cluster). Just thinking 
outloud.

We have two variables here: the length of docs and the number of docs. Carrot2 
is suitable for small numbers of docs (up to say 1000). If the docs are short 
(a paragraph or so), the clustering should be pretty fast, suitable for on-line 
processing (see: http://project.carrot2.org/algorithms.html). If the documents 
get longer, Carrot2 will still handle them, but will require some more time for 
processing, I'll try to do some measurements. But C2 is not useful for the 
whole collection case -- it performs all processing in-memory and here we'd 
need a totally different class of algorithm, something along the lines of 
Mahout developments.

bq. Hmm, that's an interesting thought. We could check to see if highlighting 
is done first.

To quickly summarise the pros and cons of relying on highlighting being done 
outside of the clustering component:

Pros:

* we avoid duplication of processing (highlighting being done twice)
* simpler code of the clustering component, less configuration

Cons:

* if someone doesn't want highlighting in the search results, the clustering is 
likely to take more time (because it operates on full documents, and it's 
controlled globally)
* depending on the highlighter, we may get some markup in the summaries, which 
may affect clustering (I'd need to check how Carrot2 handles that)

bq. Should the MockClusteringAlgorithm be under the test source tree and not 
the main one? I moved it in the patch to follow 

Absolutely, it should be in the test source.

bq. I don't think we need to output the number of clusters, since that will be 
obvious from the list size. I dropped it in the patch to follow

Makes sense, I kept it because the original version had it.

bq. Also, on the response structure, we certainly could make it optional, 
although it means having to go do a lookup in the real doc list, which could be 
less than fun.

By lookup you mean the lookup in the XML response? Here again we have a trade 
off between the length of the response and ease of processing: if we repeat 
document titles / snippets in the clusters structure, we at least double the 
response size (at least because the same document may belong to many clusters), 
but can potentially save some lookups. But if we want to get some other fields 
of a document (other than we repeat in the clusters list), we'd still need a 
lookup. 

To sum up, my intuition would be to avoid duplication and stick with document 
ids in cluster list (this is what we do in Carrot2 XMLs as well). Optionally, 
the clustering component could have a list of configurable fields to be 
repeated in the cluster list if that's really helpful in real-word use cases.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or 

[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-20 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-769:
---

Attachment: (was: SOLR-769.patch)

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-20 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-769:
---

Attachment: SOLR-769.zip

Further code clean-ups, support for passing intialization-time attributes to 
Carrot2 algorithms, some comments in the example configuration file.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-18 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-769:
---

Attachment: (was: SOLR-769-lib.zip)

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-18 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-769:
---

Attachment: SOLR-769-lib.zip

Libs with Carrot2 v3.0.1 we've just released.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-11 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-769:
---

Attachment: SOLR-769-lib.zip
SOLR-769.patch

Yet another patch, this time with passing unit tests and working example. Will 
make some more comments in a sec. Please use SOLR-769-lib.zip libs with this 
patch.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-11 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680942#action_12680942
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi All,

I've just uploaded a patch that passes unit tests and has working example, but 
this is by no means a final version. A few outstanding questions / issues:

# h4. Response structure.

I was wondering -- to we need to repeat the document contents in the 'clusters' 
response section? Assuming that each document in the index has a unique ID, we 
could reduce the size of the response by just referencing documents by IDs like 
this:
\\
{code}
lst name=clusters
 int name=numClusters3/int
 lst name=cluster
  lst name=labels
str name=labelGPU VPU Clocked/str
  /lst
  lst name=docs
str name=docEN7800GTX/2DHTV/256M/str
str name=doc100-435805/str
  /lst
 /lst
 lst name=cluster
  lst name=labels
str name=labelHard Drive/str
  /lst
  lst name=docs
str name=doc6H500F0/str
str name=docSP2514N/str
  /lst
 /lst
 lst name=cluster
  lst name=labels
str name=labelOther Topics/str
  /lst
  lst name=docs
str name=doc9885A004/str
  /lst
 /lst
{code}
Actually, this is what I've implemented in the patch.

Also, in case of hierarchical clusters I've introduced a grouping entity called 
clusters so that the top- and sub-levels or the response are consistent (see 
unit tests). Please let me know if this makes sense.

# h4 Build: compile warnings about missing SimpleXML

SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not 
needed at runtime, but generates warnings about missing dependencies during 
compile time. So the option is either to live with the warnings or to add 
SimpleXML (version 1.7.2) to get rid of the warnings.

# h4 Build: copying of protowords.txt etc

The patch includes lexical files both in the 
contrib/clustering/src/java/test/resources/ and in the examples dir. I'm 
not sure how this is handled though -- do you keep copies in the repository or 
copy those somehow in the build?

# h4 Highlighting

This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly 
well handle full documents (up to say a few hundred kB each), it's just the 
number of documents that must be in the order of hundreds. Therefore, 
highlighting is not mandatory, but it may sometimes improve the quality of 
clusters.

I was wondering, if highlighting is performed earlier in the Solr pipeline, 
could this be reused during clustering? One possible approach could be that 
clustering uses whatever is fed from the pipeline: if highlighting is enabled, 
clustering will be performed on the highlighted content, if there was no 
highlighting, we'd cluster full documents. Not sure if that's reasonable / 
possible to implement though.

# h4 Documentation (wiki) updates

Once we stabilise the ideas, I'm happy to update the wiki with regard to the 
algorithms used (Lingo/STC) and passing additional parameters.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, 

[jira] Issue Comment Edited: (SOLR-769) Support Document and Search Result clustering

2009-03-11 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680942#action_12680942
 ] 

Stanislaw Osinski edited comment on SOLR-769 at 3/11/09 10:46 AM:
--

Hi All,

I've just uploaded a patch that passes unit tests and has working example, but 
this is by no means a final version. A few outstanding questions / issues:

1. Response structure.

I was wondering -- to we need to repeat the document contents in the 'clusters' 
response section? Assuming that each document in the index has a unique ID, we 
could reduce the size of the response by just referencing documents by IDs like 
this:
\\
{code}
lst name=clusters
 int name=numClusters3/int
 lst name=cluster
  lst name=labels
str name=labelGPU VPU Clocked/str
  /lst
  lst name=docs
str name=docEN7800GTX/2DHTV/256M/str
str name=doc100-435805/str
  /lst
 /lst
 lst name=cluster
  lst name=labels
str name=labelHard Drive/str
  /lst
  lst name=docs
str name=doc6H500F0/str
str name=docSP2514N/str
  /lst
 /lst
 lst name=cluster
  lst name=labels
str name=labelOther Topics/str
  /lst
  lst name=docs
str name=doc9885A004/str
  /lst
 /lst
{code}
Actually, this is what I've implemented in the patch.

Also, in case of hierarchical clusters I've introduced a grouping entity called 
clusters so that the top- and sub-levels or the response are consistent (see 
unit tests). Please let me know if this makes sense.
\\
\\
\\
2. Build: compile warnings about missing SimpleXML

SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not 
needed at runtime, but generates warnings about missing dependencies during 
compile time. So the option is either to live with the warnings or to add 
SimpleXML (version 1.7.2) to get rid of the warnings.
\\
\\
\\
3. Build: copying of protowords.txt etc

The patch includes lexical files both in the 
contrib/clustering/src/java/test/resources/ and in the examples dir. I'm 
not sure how this is handled though -- do you keep copies in the repository or 
copy those somehow in the build?
\\
\\
\\
4. Highlighting

This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly 
well handle full documents (up to say a few hundred kB each), it's just the 
number of documents that must be in the order of hundreds. Therefore, 
highlighting is not mandatory, but it may sometimes improve the quality of 
clusters.

I was wondering, if highlighting is performed earlier in the Solr pipeline, 
could this be reused during clustering? One possible approach could be that 
clustering uses whatever is fed from the pipeline: if highlighting is enabled, 
clustering will be performed on the highlighted content, if there was no 
highlighting, we'd cluster full documents. Not sure if that's reasonable / 
possible to implement though.
\\
\\
\\
5. Documentation (wiki) updates

Once we stabilise the ideas, I'm happy to update the wiki with regard to the 
algorithms used (Lingo/STC) and passing additional parameters.

  was (Author: stanislaw.osinski):
Hi All,

I've just uploaded a patch that passes unit tests and has working example, but 
this is by no means a final version. A few outstanding questions / issues:

# h4. Response structure.

I was wondering -- to we need to repeat the document contents in the 'clusters' 
response section? Assuming that each document in the index has a unique ID, we 
could reduce the size of the response by just referencing documents by IDs like 
this:
\\
{code}
lst name=clusters
 int name=numClusters3/int
 lst name=cluster
  lst name=labels
str name=labelGPU VPU Clocked/str
  /lst
  lst name=docs
str name=docEN7800GTX/2DHTV/256M/str
str name=doc100-435805/str
  /lst
 /lst
 lst name=cluster
  lst name=labels
str name=labelHard Drive/str
  /lst
  lst name=docs
str name=doc6H500F0/str
str name=docSP2514N/str
  /lst
 /lst
 lst name=cluster
  lst name=labels
str name=labelOther Topics/str
  /lst
  lst name=docs
str name=doc9885A004/str
  /lst
 /lst
{code}
Actually, this is what I've implemented in the patch.

Also, in case of hierarchical clusters I've introduced a grouping entity called 
clusters so that the top- and sub-levels or the response are consistent (see 
unit tests). Please let me know if this makes sense.

# h4 Build: compile warnings about missing SimpleXML

SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not 
needed at runtime, but generates warnings about missing dependencies during 
compile time. So the option is either to live with the warnings or to add 
SimpleXML (version 1.7.2) to get rid of the warnings.

# h4 Build: copying of protowords.txt etc

The patch includes lexical files both in the 
contrib/clustering/src/java/test/resources/ and in the examples dir. I'm 
not sure how this is handled though -- do you keep copies in the 

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-02-10 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672281#action_12672281
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi Grant,

I've added a Carrot2 issue referring to point 3 on your TODO list: 
http://issues.carrot2.org/browse/CARROT-457. I'll be looking into this over the 
weekend.

Staszek

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-23 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642182#action_12642182
 ] 

Stanislaw Osinski commented on SOLR-769:


Bruce,

For performance of the clustering algorithm alone, please take a look at: 
http://project.carrot2.org/algorithms.html
Obviously, you'd need to add the overhead of fetching the snippets / documents 
from the index. Not sure how many are fetched and whether they come from Solr's 
cache or not, so not sure if clustering or fetching time is prevailing.

Cheers,

Staszek

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.