from:"Stanislaw Osinski \(JIRA\)"

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845441#action_12845441
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

Hi Robert,

Lucene dependency is the only change, right? Or you also upgraded Carrot2 from 
e.g. 3.1 to 3.2? If the latter is the case, the number of cluster may have 
changed e.g. because we tuned stop words or other algorithm attributes.

S.



 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845459#action_12845459
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

I was about to offer advice similar to Grant's, but wanted to wait to confirm 
the scope of changes.

If it was only Lucene dependency update, with the assumption that the update 
didn't change the documents fed to Carrot2 in tests, the results shouldn't 
change. Carrot2 uses Lucene interfaces internally, but the tokenizer is not the 
standard Lucene one; so no Version.LUCENE_* issues as far as I can tell.

I haven't got Solr code handy, but maybe the test performs clustering on 
summaries generated from the original test documents and Lucene 3.x introduces 
some changes in the way summaries are generated?

If the clusters look reasonable, the problem is probably not critical, but 
still worth investigation to make sure it's not a bug of some kind.

S.


 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845462#action_12845462
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

Yeah, the clusters look good. When you're done with upgrading Lucene to 3.x, we 
could also upgrade Carrot2 to version 3.2.0, which is LGPL-free and could be 
distributed together with Solr.

S.

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1809) Carrot2 clustering time logging

2010-03-07 Thread Stanislaw Osinski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-1809.
-

Resolution: Invalid

Hi Erik! You're right, {{debugQuery}} should be enough for most cases. 
Resolving as invalid.

 Carrot2 clustering time logging
 ---

 Key: SOLR-1809
 URL: https://issues.apache.org/jira/browse/SOLR-1809
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
 Fix For: 1.5

 Attachments: SOLR-1809.patch


 It may be useful to log the amount of time Carrot2 spent on clustering. This 
 should be helpful when debugging performance issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1809) Carrot2 clustering time logging

2010-03-05 Thread Stanislaw Osinski (JIRA)

Carrot2 clustering time logging
---

 Key: SOLR-1809
 URL: https://issues.apache.org/jira/browse/SOLR-1809
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
 Fix For: 1.5


It may be useful to log the amount of time Carrot2 spent on clustering. This 
should be helpful when debugging performance issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1809) Carrot2 clustering time logging

2010-03-05 Thread Stanislaw Osinski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1809:


Attachment: SOLR-1809.patch

An initial patch. I'm not sure what Solr's logging policies are, feel free to 
change the level as appropriate.

 Carrot2 clustering time logging
 ---

 Key: SOLR-1809
 URL: https://issues.apache.org/jira/browse/SOLR-1809
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
 Fix For: 1.5

 Attachments: SOLR-1809.patch


 It may be useful to log the amount of time Carrot2 spent on clustering. This 
 should be helpful when debugging performance issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1692) CarrotClusteringEngine produce summary does nothing

2010-01-02 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795925#action_12795925
]

Stanislaw Osinski commented on SOLR-1692:
-

{quote}
bq. Where should the configuration of the highlighter we use for clustering
come from?

We have all the code hooked in for it already, we're just ignoring the output.
{quote}

To avoid confusion and questions along the lines of why clusters don't match
the (highlighted) documents I'm seeing, I'd suggest a slightly more elaborate
scenario for the clustering highlighter configuration:

1. If main Solr highlighting is disabled, use the clustering component's
highlighter settings.
2. If main Solr highlighting is enabled, use the main highlighter's
configuration as the defaults and let the clustering-specific highlighter
configuration override the defaults.

If we do it this way, we'll minimize the chances of users accidentally
performing clustering on documents different (differently highlighted) than
those they will see.

bq. Would be great if, Carrot2 could also just use the analysis that
Lucene/Solr produces, that way it would be much easier to configure stopwords,
HTML stripping, etc.

This one would require some larger changes to Carrot2 internals. We do use
Lucene infrastructure for preprocessing (currently for tokenization), but I can
investigate if we can extend that further. A potential problem here is that
very often the set of stopwords you use for document retrieval may not work
equally well for clustering. I've filed a [Carrot2-specific
issue|http://issues.carrot2.org/browse/CARROT-606] for it and will try to come
up with something.

CarrotClusteringEngine produce summary does nothing
---

Key: SOLR-1692
URL: https://issues.apache.org/jira/browse/SOLR-1692
Project: Solr
Issue Type: Bug
Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Fix For: 1.5

Attachments: SOLR-1692.patch

In the CarrotClusteringEngine, the produceSummary option does nothing, as the
results of doing the highlighting are just ignored.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-236) Field collapsing

2009-12-29 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795067#action_12795067
 ] 

Stanislaw Osinski commented on SOLR-236:


Hi Grant,

{quote}
I would note, in looking at the Carrot2 code, they actually have a 
ByFieldClusteringAlgorithm (what they call synthetic clustering) which does 
field collapsing/clustering on a value of a field. To quote the javadocs:

Clusters documents into a flat structure based on the values of some field of 
the documents. By default the \...@link Document#SOURCES} field is used and  
Name of the field to cluster by. Each non-null scalar field value with distinct 
hash code will give raise to a single cluster, named using the \...@link 
Object#toString()} value of the field. If the field value is a collection, the 
document will be assigned to all clusters corresponding to the values in the 
collection. Note that arrays will not be 'unfolded' in this way.

I don't know how it performs, but it seems like it would at least be worth 
investigating.
{quote}

Carrot2's {{ByFieldClusteringAlgorithm}} is very simple. It literally throws 
everything into a hash map based on the field value ([source 
code|http://fisheye3.atlassian.com/browse/carrot2/trunk/core/carrot2-algorithm-synthetic/src/org/carrot2/clustering/synthetic/ByFieldClusteringAlgorithm.java?r=trunk#l99]).
 This algorithm is used in our live demo to [cluster by news 
source|http://search.carrot2.org/stable/search?source=boss-newsquery=iphonealgorithm=source].

{quote}
Note, they also have a synthetic one for collapsing based on URL: 
ByUrlClusteringAlgorithm
{quote}

This one creates a [hierarchy based on the URL 
segments|http://search.carrot2.org/stable/search?source=boss-webquery=solralgorithm=urlresults=200]
 and might be useful to create by-domain collapsing if needed.

In general, my rough guess is that it's the criteria for content-based 
collapsing would be closer to duplicate detection rather than the type of 
grouping Carrot2 produces.

 Field collapsing
 

 Key: SOLR-236
 URL: https://issues.apache.org/jira/browse/SOLR-236
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Emmanuel Keller
Assignee: Shalin Shekhar Mangar
 Fix For: 1.5

 Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
 collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
 collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, 
 field-collapse-4-with-solrj.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
 field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
 field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
 field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
 quasidistributed.additional.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, 
 SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, 
 SOLR-236_collapsing.patch


 This patch include a new feature called Field collapsing.
 Used in order to collapse a group of results with similar value for a given 
 field to a single entry in the result set. Site collapsing is a special case 
 of this, where all results for a given web site is collapsed into one or two 
 entries in the result set, typically with an associated more documents from 
 this site link. See also Duplicate detection.
 http://www.fastsearch.com/glossary.aspx?m=48amid=299
 The implementation add 3 new query parameters (SolrParams):
 collapse.field to choose the field used to group results
 collapse.type normal (default value) or adjacent
 collapse.max to select how many continuous results are allowed before 
 collapsing
 TODO (in progress):
 - More documentation (on source code)
 - Test cases
 Two patches:
 - field_collapsing.patch for current development version
 - field_collapsing_1.1.0.patch for Solr-1.1.0
 P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-28 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12760238#action_12760238
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

The required change is right at the end of the big diff:

{noformat}
Index: 
contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java
===
--- 
contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java
 (revision 819270)
+++ 
contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java
 (working copy)
@@ -40,11 +40,11 @@
 @SuppressWarnings(unchecked)
 public class CarrotClusteringEngineTest extends AbstractClusteringTest {
   public void testCarrotLingo() throws Exception {
-checkEngine(getClusteringEngine(default), 9);
+checkEngine(getClusteringEngine(default), 10);
   }
 
   public void testCarrotStc() throws Exception {
-checkEngine(getClusteringEngine(stc), 2);
+checkEngine(getClusteringEngine(stc), 1);
   }
 
   public void testWithoutSubclusters() throws Exception {
{noformat}

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4

 Attachments: SOLR-1314.patch


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-27 Thread Stanislaw Osinski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1314:


Attachment: SOLR-1314.patch

Hi Grant,

I've built Carrot2 3.1.0 binaries and tested them with Solr trunk. Attached is 
a patch that upgrades the libs to Carrot2 3.1.0 and fixes one unit test.

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4

 Attachments: SOLR-1314.patch


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-25 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12759667#action_12759667
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Hi Grant,

bq. Now that Lucene is final, can we finalize the jar for this one? 

Sure, over the weekend we'll be making an official Carrot2 3.1.0 release. As 
part of that process I'll check if the Solr plugin is working fine and will 
post the final JAR here.

bq. Also, this final JAR will handle the license and FastVector stuff, right?

Correct. The following commit removed it from trunk and hence the 3.1.0 release:

http://fisheye3.atlassian.com/changelog/carrot2/?cs=3694

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-23 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758843#action_12758843
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Hi Grant,

I've made Carrot2's dependency on Smart Chinese Analyzer optional, so no 
exceptions should be thrown when the big JAR is not in the classpath. As usual, 
download from here:

http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-16 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756110#action_12756110
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Hi Grant,

I've just dropped the patenting clause entirely. The updated license is in the 
repo and at: http://www.carrot2.org/carrot2.LICENSE.

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

2009-09-16 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756177#action_12756177
]

Stanislaw Osinski commented on SOLR-1336:
-

Keeping the Chinese analyzer JAR optional sounds good. As Carrot2 also uses it,
I'd need to make sure the clustering contrib doesn't fail when the JAR is not
there and clustering in Chinese is requested (I think I'd simply log a WARN
saying that the Chinese analyzer JAR is required for best clustering results).

Add support for lucene's SmartChineseAnalyzer
-

Key: SOLR-1336
URL: https://issues.apache.org/jira/browse/SOLR-1336
Project: Solr
Issue Type: New Feature
Components: Analysis
Reporter: Robert Muir
Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch

SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese
text as words.
if the factories for the tokenizer and word token filter are added to solr it
can be used, although there should be a sample config or wiki entry showing
how to apply the built-in stopwords list.
this is because it doesn't contain actual stopwords, but must be used to
prevent indexing punctuation...
note: we did some refactoring/cleanup on this analyzer recently, so it would
be much easier to do this after the next lucene update.
it has also been moved out of -analyzers.jar due to size, and now builds in
its own smartcn jar file, so that would need to be added if this feature is
desired.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-15 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755657#action_12755657
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

As a follow-up of the discussion on legal-discuss, I've removed the dependency 
on {{FastVector}} from Carrot2's STC algorithm. The binaries are in the usual 
place:

http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-13 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754699#action_12754699
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Good point, Grant. Though the classes we included are merely definitions of 
native methods, it's better to keep them separate. I've just reverted back to a 
separate {{nni.jar}}, binaries are here:

http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-12 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754593#action_12754593
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Let me build C2 with Lucene 2.9 RC4, will post a download URL in a while.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-12 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754597#action_12754597
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Hi Grant,

Here's Carrot2 3.1-dev built with Lucene 2.9-rc4:

http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/

Please note a few things about the dependencies:

* {{nni.jar}} is now part of {{carrot2-mini.jar}}, so no need to download it 
separately
* dependencies upgraded to the newer versions 
(http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/carrot2-mini-3.1-dev.pom),
 Lucene entry in the POM still needs to be upgraded for version 2.9
* Carrot2 provides experimental support for Chinese Simplified based on the 
smart cn analyzer -- does Solr distribute that JAR by default?

Please let me know if you have any problems upgrading.

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-28 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736030#action_12736030
]

Stanislaw Osinski commented on SOLR-769:

Hi Grant,

There's one more thing: we're planning to release version 3.1.0 of Carrot2 with
certain bug fixes in clustering algorithm and better support for Chinese (using
the new analyzer from Lucene). Our plan is to release after Lucene 2.9 is out,
but before Solr 1.4, so that the latter would have a newer version of Carrot2
on board (should be just a matter of replacing Carrot2 JAR / upgrading version
of the downloaded dependency). Would that make sense? Should I create a
separate issue for it, or rather reopen this one?

Thanks,

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

Attachments: clustering-componet-shard.patch, clustering-libs.tar,
clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch

Clustering is a useful tool for working with documents and search results,
similar to the notion of dynamic faceting. Carrot2
(http://project.carrot2.org/) is a nice, BSD-licensed, library for doing
search results clustering. Mahout (http://lucene.apache.org/mahout) is well
suited for whole-corpus clustering.
The patch I lays out a contrib module that starts off w/ an integration of a
SearchComponent for doing clustering and an implementation using Carrot. In
search results mode, it will use the DocList as the input for the cluster.
While Carrot2 comes w/ a Solr input component, it is not the same as the
SearchComponent that I have in that the Carrot example actually submits a
query to Solr, whereas my SearchComponent is just chained into the Component
list and uses the ResponseBuilder to add in the cluster results.
While not fully fleshed out yet, the collection based mode will take in a
list of ids or just use the whole collection and will produce clusters.
Since this is a longer, typically offline task, there will need to be some
type of storage mechanism (and replication??) for the clusters. I _may_
push this off to a separate JIRA issue, but I at least want to present the
use case as part of the design of this component/contrib. It may even make
sense that we split this out, such that the building piece is something like
an UpdateProcessor and then the SearchComponent just acts as a lookup
mechanism.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-07-28 Thread Stanislaw Osinski (JIRA)

Upgrade Carrot2 to version 3.1.0


 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
 Fix For: 1.4


As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
in clustering algorithms and improved clustering in Chinese. The upgrade should 
be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-28 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736039#action_12736039
]

Stanislaw Osinski commented on SOLR-769:

Created: SOLR-1314. I'll attach a patch there as soon as Lucene 2.9 is released.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-07-08 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stanislaw Osinski updated SOLR-769:
---

Attachment: subcluster-flattening.patch

Hi,

While configuring the clustering component for an algorithm that returns
hierarchical clusters, it took me a while to debug why subclusters wouldn't
appear on the output. It turned out that the default value for the
{{carrot.outputSubClusters}} parameter is {{false}}, which was the opposite to
what I assumed :-) Would it be a problem to change the default to {{true}}, so
that other users avoid the same problem?

Another improvement worth making for the {{carrot.outputSubClusters}} =
{{false}} case is flattening the clusters: returning all documents of the 1st
level clusters, including those contained in the subclusters the user chose not
to output. Without this improvement, many document-cluster assignments may be
lost because some Carrot2 algorithms will assign documents only to the leaf
(deepest in the hierarchy) clusters.

I'm attaching a patch that implements both changes.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725739#action_12725739
]

Stanislaw Osinski commented on SOLR-769:

bq. Is labels is needed because there could be multiple labels per cluster in
the future? ( I assume yes)

Correct. Currently neither of Carrot2's algorithms creates clusters with
multiple labels, but it's quite likely that there are other algorithms that can
do that.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712534#action_12712534
]

Stanislaw Osinski commented on SOLR-769:

In fact, you can set Carrot2 attributes (both init- and request-time) in the
solr config file, this should work also without the patch. Just add:

to the search component element. See
http://wiki.apache.org/solr/ClusteringComponent for some example. You'll find
list of Carrot2 attributes, their ids and description at:
http://download.carrot2.org/stable/manual/#chapter.components.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712545#action_12712545
]

Stanislaw Osinski commented on SOLR-769:

Ah, I should have mentioned that up front -- Carrot2 will try to convert the
string into the type accepted by the attribute. In case of the class-types
attributes, it will try to load the class using the current thread's context
classloader. Conversions are also available for numeric, boolean and enum
attributes (see:
http://download.carrot2.org/head/javadoc/org/carrot2/util/attribute/AttributeBinder.AttributeTransformerFromString.html).
Please let me know if that way works for you.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-23 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712421#action_12712421
]

Stanislaw Osinski commented on SOLR-769:

Pasting the comment I made on the list:

The catch with analyzer is that this specific attribute is an
initialization-time attribute, so you need to add it to the {{initAttributes}}
map in the {{init()}} method of {{CarrotClusteringEngine}}.

Please let me know if this solves the problem. If not, I'll investigate further.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

Attachments: clustering-componet-shard.patch, clustering-libs.tar,
clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.tar, SOLR-769.zip

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-16 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710087#action_12710087
]

Stanislaw Osinski commented on SOLR-769:

Thanks Grant! Looking forward to seeing the code in the repo!

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

Attachments: clustering-libs.tar, clustering-libs.tar,
SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar,
SOLR-769.zip

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-04-03 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695463#action_12695463
]

Stanislaw Osinski commented on SOLR-769:

Hi Grant,

If you download http://download.carrot2.org/stable/carrot2-java-api-3.0.1.zip,
you'll find licenses in the lib/ folder of the distribution. That distribution
contains slightly more JARs than needed for Solr (which uses carrot2-mini.jar),
so you'd need to pick only those that are relevant.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-22 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688171#action_12688171
]

Stanislaw Osinski commented on SOLR-769:

bq. Also, you say C2 can handle full docs, is it feasible, then to implement it
for the offline mode I have in mind, whereby you cluster the whole collection
offline and then store the clusters for retrieval? I haven't implemented this
yet, but was thinking some people will be interested in full corpus clustering.
The nice thing, then, is that as new documents come in, they can be added to
existing clusters (and maybe periodically, we re-cluster). Just thinking
outloud.

We have two variables here: the length of docs and the number of docs. Carrot2
is suitable for small numbers of docs (up to say 1000). If the docs are short
(a paragraph or so), the clustering should be pretty fast, suitable for on-line
processing (see: http://project.carrot2.org/algorithms.html). If the documents
get longer, Carrot2 will still handle them, but will require some more time for
processing, I'll try to do some measurements. But C2 is not useful for the
whole collection case -- it performs all processing in-memory and here we'd
need a totally different class of algorithm, something along the lines of
Mahout developments.

bq. Hmm, that's an interesting thought. We could check to see if highlighting
is done first.

To quickly summarise the pros and cons of relying on highlighting being done
outside of the clustering component:

Pros:

* we avoid duplication of processing (highlighting being done twice)
* simpler code of the clustering component, less configuration

Cons:

* if someone doesn't want highlighting in the search results, the clustering is
likely to take more time (because it operates on full documents, and it's
controlled globally)
* depending on the highlighter, we may get some markup in the summaries, which
may affect clustering (I'd need to check how Carrot2 handles that)

bq. Should the MockClusteringAlgorithm be under the test source tree and not
the main one? I moved it in the patch to follow

Absolutely, it should be in the test source.

bq. I don't think we need to output the number of clusters, since that will be
obvious from the list size. I dropped it in the patch to follow

Makes sense, I kept it because the original version had it.

bq. Also, on the response structure, we certainly could make it optional,
although it means having to go do a lookup in the real doc list, which could be
less than fun.

By lookup you mean the lookup in the XML response? Here again we have a trade
off between the length of the response and ease of processing: if we repeat
document titles / snippets in the clusters structure, we at least double the
response size (at least because the same document may belong to many clusters),
but can potentially save some lookups. But if we want to get some other fields
of a document (other than we repeat in the clusters list), we'd still need a
lookup.

To sum up, my intuition would be to avoid duplication and stick with document
ids in cluster list (this is what we do in Carrot2 XMLs as well). Optionally,
the clustering component could have a list of configurable fields to be
repeated in the cluster list if that's really helpful in real-word use cases.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-20 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stanislaw Osinski updated SOLR-769:
---

Attachment: (was: SOLR-769.patch)

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-20 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stanislaw Osinski updated SOLR-769:
---

Attachment: SOLR-769.zip

Further code clean-ups, support for passing intialization-time attributes to
Carrot2 algorithms, some comments in the example configuration file.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-18 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stanislaw Osinski updated SOLR-769:
---

Attachment: (was: SOLR-769-lib.zip)

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Attachments: clustering-libs.tar, clustering-libs.tar,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-18 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stanislaw Osinski updated SOLR-769:
---

Attachment: SOLR-769-lib.zip

Libs with Carrot2 v3.0.1 we've just released.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Attachments: clustering-libs.tar, clustering-libs.tar,
SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-11 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stanislaw Osinski updated SOLR-769:
---

Attachment: SOLR-769-lib.zip
SOLR-769.patch

Yet another patch, this time with passing unit tests and working example. Will
make some more comments in a sec. Please use SOLR-769-lib.zip libs with this
patch.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Attachments: clustering-libs.tar, clustering-libs.tar,
SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-11 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680942#action_12680942
]

Stanislaw Osinski commented on SOLR-769:

Hi All,

I've just uploaded a patch that passes unit tests and has working example, but
this is by no means a final version. A few outstanding questions / issues:

# h4. Response structure.

I was wondering -- to we need to repeat the document contents in the 'clusters'
response section? Assuming that each document in the index has a unique ID, we
could reduce the size of the response by just referencing documents by IDs like
this:
\\
{code}
lst name=clusters
int name=numClusters3/int
lst name=cluster
lst name=labels
str name=labelGPU VPU Clocked/str
/lst
lst name=docs
str name=docEN7800GTX/2DHTV/256M/str
str name=doc100-435805/str
/lst
/lst
lst name=cluster
lst name=labels
str name=labelHard Drive/str
/lst
lst name=docs
str name=doc6H500F0/str
str name=docSP2514N/str
/lst
/lst
lst name=cluster
lst name=labels
str name=labelOther Topics/str
/lst
lst name=docs
str name=doc9885A004/str
/lst
/lst
{code}
Actually, this is what I've implemented in the patch.

Also, in case of hierarchical clusters I've introduced a grouping entity called
clusters so that the top- and sub-levels or the response are consistent (see
unit tests). Please let me know if this makes sense.

# h4 Build: compile warnings about missing SimpleXML

SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not
needed at runtime, but generates warnings about missing dependencies during
compile time. So the option is either to live with the warnings or to add
SimpleXML (version 1.7.2) to get rid of the warnings.

# h4 Build: copying of protowords.txt etc

The patch includes lexical files both in the
contrib/clustering/src/java/test/resources/ and in the examples dir. I'm
not sure how this is handled though -- do you keep copies in the repository or
copy those somehow in the build?

# h4 Highlighting

This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly
well handle full documents (up to say a few hundred kB each), it's just the
number of documents that must be in the order of hundreds. Therefore,
highlighting is not mandatory, but it may sometimes improve the quality of
clusters.

I was wondering, if highlighting is performed earlier in the Solr pipeline,
could this be reused during clustering? One possible approach could be that
clustering uses whatever is fed from the pipeline: if highlighting is enabled,
clustering will be performed on the highlighted content, if there was no
highlighting, we'd cluster full documents. Not sure if that's reasonable /
possible to implement though.

# h4 Documentation (wiki) updates

Once we stabilise the ideas, I'm happy to update the wiki with regard to the
algorithms used (Lingo/STC) and passing additional parameters.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Attachments: clustering-libs.tar, clustering-libs.tar,
SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch

[jira] Issue Comment Edited: (SOLR-769) Support Document and Search Result clustering

2009-03-11 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680942#action_12680942
]

Stanislaw Osinski edited comment on SOLR-769 at 3/11/09 10:46 AM:
--

Hi All,

I've just uploaded a patch that passes unit tests and has working example, but
this is by no means a final version. A few outstanding questions / issues:

1. Response structure.

Once we stabilise the ideas, I'm happy to update the wiki with regard to the
algorithms used (Lingo/STC) and passing additional parameters.

was (Author: stanislaw.osinski):
Hi All,

I've just uploaded a patch that passes unit tests and has working example, but
this is by no means a final version. A few outstanding questions / issues:

# h4. Response structure.

# h4 Build: compile warnings about missing SimpleXML

# h4 Build: copying of protowords.txt etc

The patch includes lexical files both in the
contrib/clustering/src/java/test/resources/ and in the examples dir. I'm
not sure how this is handled though -- do you keep copies in the

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-02-10 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672281#action_12672281
]

Stanislaw Osinski commented on SOLR-769:

Hi Grant,

I've added a Carrot2 issue referring to point 3 on your TODO list:
http://issues.carrot2.org/browse/CARROT-457. I'll be looking into this over the
weekend.

Staszek

Support Document and Search Result clustering
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-23 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642182#action_12642182
]

Stanislaw Osinski commented on SOLR-769:

Bruce,

For performance of the clustering algorithm alone, please take a look at:
http://project.carrot2.org/algorithms.html
Obviously, you'd need to add the overhead of fetching the snippets / documents
from the index. Not sure how many are fetched and whether they come from Solr's
cache or not, so not sure if clustering or fetching time is prevailing.

Cheers,

Staszek

Support Document and Search Result clustering
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

38 matches

Mail list logo