subject:"\[jira\] Commented\: \(SOLR\-769\) Support Document and Search Result clustering"

[jira] [Commented] (SOLR-769) Support Document and Search Result clustering

2014-07-28 Thread Patrick Morton (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076124#comment-14076124
 ] 

Patrick Morton commented on SOLR-769:
-

In 2003, individuals played in often 130 senses. 
http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851520/7787447-29851520-stopadd30.html
 
Years on novel adderall 5mg in fxs are limited.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, clustering-componet-shard.patch, 
 clustering-libs.tar, clustering-libs.tar, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-769) Support Document and Search Result clustering

2014-07-28 Thread Patrick Morton (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076122#comment-14076122
 ] 

Patrick Morton commented on SOLR-769:
-

The atypical athletes of similar romantic face and health sleep competition 
some treatments which can make differentiating them unlikely. 
adderall price street 
http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851518/7787447-29851518-stopadd28.html
 
Despite the humor walton felt for hemingway, he was solely sinister of his 
pretreatment.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, clustering-componet-shard.patch, 
 clustering-libs.tar, clustering-libs.tar, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2010-12-13 Thread Koji Sekiguchi (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970758#action_12970758
]

Koji Sekiguchi commented on SOLR-769:
-

Apologies Grant for quote your comment on 27/Jul/09:

bq. Also applied Stanislaw's patch.

I'm confused by this line:

{code}
ListDocument docs = outputSubClusters ? outCluster.getDocuments() :
outCluster.getAllDocuments();
{code}

According to Carrot2 Javadoc:

http://download.carrot2.org/stable/javadoc/org/carrot2/core/Cluster.html#getAllDocuments%28%29

Should it be:

{code}
ListDocument docs = outputSubClusters ? outCluster.getAllDocuments() :
outCluster.getDocuments();
{code}

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

Attachments: clustering-componet-shard.patch, clustering-libs.tar,
clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch

Clustering is a useful tool for working with documents and search results,
similar to the notion of dynamic faceting. Carrot2
(http://project.carrot2.org/) is a nice, BSD-licensed, library for doing
search results clustering. Mahout (http://lucene.apache.org/mahout) is well
suited for whole-corpus clustering.
The patch I lays out a contrib module that starts off w/ an integration of a
SearchComponent for doing clustering and an implementation using Carrot. In
search results mode, it will use the DocList as the input for the cluster.
While Carrot2 comes w/ a Solr input component, it is not the same as the
SearchComponent that I have in that the Carrot example actually submits a
query to Solr, whereas my SearchComponent is just chained into the Component
list and uses the ResponseBuilder to add in the cluster results.
While not fully fleshed out yet, the collection based mode will take in a
list of ids or just use the whole collection and will produce clusters.
Since this is a longer, typically offline task, there will need to be some
type of storage mechanism (and replication??) for the clusters. I _may_
push this off to a separate JIRA issue, but I at least want to present the
use case as part of the design of this component/contrib. It may even make
sense that we split this out, such that the building piece is something like
an UpdateProcessor and then the SearchComponent just acts as a lookup
mechanism.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2010-12-13 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970760#action_12970760
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi Koji,

Actually, the current code seems right: if we don't output subclusters, we need 
to include all documents of the cluster, including those from its subclusters, 
otherwise the subclusters' documents may not appear in the response at all. But 
if we do output subclusters, we add only the documents assigned specifically to 
the cluster because the subclusters with their documents will be included in 
the response too.

S.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2010-12-13 Thread Koji Sekiguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970762#action_12970762
 ] 

Koji Sekiguchi commented on SOLR-769:
-

Uh, I needed to read the part of the recursive call. Thanks for explanation!

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-28 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736030#action_12736030
]

Stanislaw Osinski commented on SOLR-769:

Hi Grant,

There's one more thing: we're planning to release version 3.1.0 of Carrot2 with
certain bug fixes in clustering algorithm and better support for Chinese (using
the new analyzer from Lucene). Our plan is to release after Lucene 2.9 is out,
but before Solr 1.4, so that the latter would have a newer version of Carrot2
on board (should be just a matter of replacing Carrot2 JAR / upgrading version
of the downloaded dependency). Would that make sense? Should I create a
separate issue for it, or rather reopen this one?

Thanks,

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-28 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736037#action_12736037
]

Grant Ingersoll commented on SOLR-769:
--

bq. Would that make sense? Should I create a separate issue for it, or rather
reopen this one?

Yes, I think that makes sense. Separate issue would be good, this one is long
enough.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-28 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736039#action_12736039
]

Stanislaw Osinski commented on SOLR-769:

Created: SOLR-1314. I'll attach a patch there as soon as Lucene 2.9 is released.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-27 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735635#action_12735635
]

Grant Ingersoll commented on SOLR-769:
--

bq. classloading issues after the hander was removed from solr.war

I think the issue is that changes you made don't include the actual include the
clustering code in Solr when running the example. I think we just need to
copy over the clustering JAR from the build directory into the lib, but that is
a bit weird, IMO.

To fix, I'm going to make the example target create a proper Solr home under
contrib/clustering/example. Which, of course, isn't much different from how it
used to be. I am also going to restore the downloads directory for
packaging/release functionality.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-27 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735638#action_12735638
 ] 

Grant Ingersoll commented on SOLR-769:
--

Note, I believe there is also a classloading issue when trying to load the 
carrot algorithm, b/c it does not use the SolrResourceLoader

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-27 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735643#action_12735643
]

Grant Ingersoll commented on SOLR-769:
--

OK, I have committed my changes and believe functionality is restored and is
properly working with the SolrResourceLoader. Also applied Stanislaw's patch.

Still likely need to review how to distribute all of this. My guess is that we
should only include the source, including the build and instructions for
installing, and not even package jars at all since we can't include the LGPL
ones necessary for Carrot2.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-07 Thread Brad Giaccio

If you could could my patch to handle shards be applied before you reformat
so I don't have to piece it together again and resubmit?

Brad

On Thu, Jul 2, 2009 at 9:53 AM, Yonik Seeley (JIRA) j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726482#action_12726482]

 Yonik Seeley commented on SOLR-769:
 ---

 Anyone mind if I reformat the source files that currently use tabs?

  Support Document and Search Result clustering
  -
 
  Key: SOLR-769
  URL: https://issues.apache.org/jira/browse/SOLR-769
  Project: Solr
   Issue Type: New Feature
 Reporter: Grant Ingersoll
 Assignee: Yonik Seeley
 Priority: Minor
  Fix For: 1.4
 
  Attachments: clustering-componet-shard.patch,
 clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch,
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
 SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip
 
 
  Clustering is a useful tool for working with documents and search
 results, similar to the notion of dynamic faceting.  Carrot2 (
 http://project.carrot2.org/) is a nice, BSD-licensed, library for doing
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is
 well suited for whole-corpus clustering.
  The patch I lays out a contrib module that starts off w/ an integration
 of a SearchComponent for doing clustering and an implementation using
 Carrot.  In search results mode, it will use the DocList as the input for
 the cluster.   While Carrot2 comes w/ a Solr input component, it is not the
 same as the SearchComponent that I have in that the Carrot example actually
 submits a query to Solr, whereas my SearchComponent is just chained into the
 Component list and uses the ResponseBuilder to add in the cluster results.
  While not fully fleshed out yet, the collection based mode will take in a
 list of ids or just use the whole collection and will produce clusters.
  Since this is a longer, typically offline task, there will need to be some
 type of storage mechanism (and replication??) for the clusters.  I _may_
 push this off to a separate JIRA issue, but I at least want to present the
 use case as part of the design of this component/contrib.  It may even make
 sense that we split this out, such that the building piece is something like
 an UpdateProcessor and then the SearchComponent just acts as a lookup
 mechanism.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-07 Thread Brad Giaccio (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12728270#action_12728270
 ] 

Brad Giaccio commented on SOLR-769:
---

If you could , could my patch to handle shards be applied before you reformat 
so I don't have to piece it together again and resubmit?

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-07 Thread Yonik Seeley (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12728271#action_12728271
]

Yonik Seeley commented on SOLR-769:
---

Apologies Brad - I didn't realize there were pending patches or I would have
not done the reformat.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-07 Thread Brad Giaccio

No problem, but now that you are the assignee perhaps you can apply it for
me, its attached to the ticket as 'clustering-component-shard.patch' and it
includes update junit tests.

If it needs some work now that you have made some output changes I'll be
glad to update.

Thanks,
Brad

On Tue, Jul 7, 2009 at 2:16 PM, Yonik Seeley (JIRA) j...@apache.org wrote:

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12728271#action_12728271]

Yonik Seeley commented on SOLR-769:
---

Apologies Brad - I didn't realize there were pending patches or I would
have not done the reformat.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
Fix For: 1.4

Attachments: clustering-componet-shard.patch,
clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch,
SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip

Clustering is a useful tool for working with documents and search
results, similar to the notion of dynamic faceting. Carrot2 (
http://project.carrot2.org/) is a nice, BSD-licensed, library for doing
search results clustering. Mahout (http://lucene.apache.org/mahout) is
well suited for whole-corpus clustering.
The patch I lays out a contrib module that starts off w/ an integration
of a SearchComponent for doing clustering and an implementation using
Carrot. In search results mode, it will use the DocList as the input for
the cluster. While Carrot2 comes w/ a Solr input component, it is not the
same as the SearchComponent that I have in that the Carrot example actually
submits a query to Solr, whereas my SearchComponent is just chained into the
Component list and uses the ResponseBuilder to add in the cluster results.
While not fully fleshed out yet, the collection based mode will take in a
list of ids or just use the whole collection and will produce clusters.
Since this is a longer, typically offline task, there will need to be some
type of storage mechanism (and replication??) for the clusters. I _may_
push this off to a separate JIRA issue, but I at least want to present the
use case as part of the design of this component/contrib. It may even make
sense that we split this out, such that the building piece is something like
an UpdateProcessor and then the SearchComponent just acts as a lookup
mechanism.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-04 Thread Yonik Seeley (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12727247#action_12727247
]

Yonik Seeley commented on SOLR-769:
---

Of course, now that I've removed the clustering libs from the solr.war, the
example no longer works for some reason... looks like all the jars are in
example/clustering/solr/lib, so it's classloading issues I imagine.

On a related note, I'm not sure how useful it is to have a clustering component
with multiple plugins itself... the extra level of plugins seems to just add
more complexity. Different plugins could always share utility classes, perhaps
even base classes, and could strive for a common output format - all without
going to an additional plugin model.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-02 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726482#action_12726482
 ] 

Yonik Seeley commented on SOLR-769:
---

Anyone mind if I reformat the source files that currently use tabs?

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-02 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726485#action_12726485
]

Mark Miller commented on SOLR-769:
--

bq. Anyone mind if I reformat the source files that currently use tabs?

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725694#action_12725694
 ] 

Yonik Seeley commented on SOLR-769:
---

Now that I'm looking at some of the code, is there a reason why clustering 
doesn't use a SolrQueryRequest, but instead grabs a searcher directly from the 
core?

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725702#action_12725702
]

Grant Ingersoll commented on SOLR-769:
--

bq. Now that I'm looking at some of the code, is there a reason why clustering
doesn't use a SolrQueryRequest, but instead grabs a searcher directly from the
core?

Because the clustering engine gets initialized during core initialization and
thus doesn't have a SolrQueryRequest at that time. Is there harm in the way
it's being done? I suppose it adds an extra reference, right, meaning it could
keep a core open longer?

In the case of document clustering, I think it could be a long running job.
It's not clear yet how that should work, but it is something to keep in mind.
I expect to implement that sometime this summer, likely after 1.4.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725703#action_12725703
]

Grant Ingersoll commented on SOLR-769:
--

Also, some implementations may need lower level interfaces than Searcher, it
just seems easier to have core access.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Yonik Seeley (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725730#action_12725730
]

Yonik Seeley commented on SOLR-769:
---

I'm talking about the search results clustering, which is per-request.
RequestHandlers should pretty much always use the core/searcher associated with
the SolrQueryRequest. newSearcher/firstSearcher hooks set this themselves,
hence it's a different searcher than one would get from getSearcher() (and
could possibly even cause a deadlock). Architecturally, there could be any
number of reasons to use a different searcher in the future... the
SolrQueryRequest says which searcher to use.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725731#action_12725731
]

Grant Ingersoll commented on SOLR-769:
--

Makes sense, might need to refactor some of the initialization code and the
abstract clustering engine, but no big deal.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725739#action_12725739
]

Stanislaw Osinski commented on SOLR-769:

bq. Is labels is needed because there could be multiple labels per cluster in
the future? ( I assume yes)

Correct. Currently neither of Carrot2's algorithms creates clusters with
multiple labels, but it's quite likely that there are other algorithms that can
do that.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-27 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724847#action_12724847
 ] 

Yonik Seeley commented on SOLR-769:
---

The response structure is a bit funny (it's like normal XML, which we don't 
really use in Solr-land), and certainly not optimal for JSON responses:

{code}
 clusters:[
  cluster,[
labels,[
 label,DDR],
docs,[
 doc,TWINX2048-3200PRO,
 doc,VS1GB400C3,
 doc,VDBDB1A16]],
  cluster,[
labels,[
 label,Car Power Adapter],
docs,[
 doc,F8V7067-APL-KIT,
 doc,IW-02]],
[...]
{code}

Is labels  is needed because there could be multiple labels per cluster in 
the future?  ( I assume yes)
Do we need more per-doc information than just the id?  (I assume no)
Could we want other per-cluster information in the future (I assume yes)
What other possible information could be added in the future?

Given the assumptions above, clusters, docs, and labels should all be 
arrays instead of NamedLists (the names are just repeated redundant info).
All of the remaining NamedLists(just each cluster) should be a 
SimpleOrderedMap since access by key is more important than order... that will 
give us something along the lines of:

{code}
clusters : [
{ labels : [DDR],
docs:[TWINX2048-3200PRO,VS1GB400C3,VDBDB1A16]
}
,
{ labels : [Car Power Adapter],
docs:[F8V7067-APL-KIT,IW-02]
}
]
{code}

Make sense?

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-27 Thread Yonik Seeley (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724854#action_12724854
]

Yonik Seeley commented on SOLR-769:
---

I hit an error trying to cluster some documents I added with solr cell - 400
unknown field Author.
Seems like it would be nice if we could handle unknown field types gracefully?

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712534#action_12712534
]

Stanislaw Osinski commented on SOLR-769:

In fact, you can set Carrot2 attributes (both init- and request-time) in the
solr config file, this should work also without the patch. Just add:

to the search component element. See
http://wiki.apache.org/solr/ClusteringComponent for some example. You'll find
list of Carrot2 attributes, their ids and description at:
http://download.carrot2.org/stable/manual/#chapter.components.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Koji Sekiguchi (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712544#action_12712544
]

Koji Sekiguchi commented on SOLR-769:
-

{quote}
In fact, you can set Carrot2 attributes (both init- and request-time) in the
solr config file, this should work also without the patch. Just add:

str name=Tokenizer.analyzerfully.qualified.class.Name/str
{quote}

Hmm, I thought I need to assign Class? type (other than String) for the
second argument of the attribute. I'll try it.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712545#action_12712545
]

Stanislaw Osinski commented on SOLR-769:

Ah, I should have mentioned that up front -- Carrot2 will try to convert the
string into the type accepted by the attribute. In case of the class-types
attributes, it will try to load the class using the current thread's context
classloader. Conversions are also available for numeric, boolean and enum
attributes (see:
http://download.carrot2.org/head/javadoc/org/carrot2/util/attribute/AttributeBinder.AttributeTransformerFromString.html).
Please let me know if that way works for you.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Koji Sekiguchi (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712557#action_12712557
]

Koji Sekiguchi commented on SOLR-769:
-

{code}
str name=Tokenizer.analyzerfully.qualified.class.Name/str
{code}

This works as expected w/o my patch. Thank you, Stanislaw!

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-23 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712421#action_12712421
]

Stanislaw Osinski commented on SOLR-769:

Pasting the comment I made on the list:

The catch with analyzer is that this specific attribute is an
initialization-time attribute, so you need to add it to the {{initAttributes}}
map in the {{init()}} method of {{CarrotClusteringEngine}}.

Please let me know if this solves the problem. If not, I'll investigate further.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

Attachments: clustering-componet-shard.patch, clustering-libs.tar,
clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.tar, SOLR-769.zip

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-23 Thread Koji Sekiguchi (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712442#action_12712442
]

Koji Sekiguchi commented on SOLR-769:
-

bq. The catch with analyzer is that this specific attribute is an
initialization-time attribute, so you need to add it to the initAttributes map
in the init() method of CarrotClusteringEngine.

This solves the problem. Thank you!

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-20 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711137#action_12711137
]

Grant Ingersoll commented on SOLR-769:
--

Committed revision 776692.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

Attachments: clustering-libs.tar, clustering-libs.tar,
SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar,
SOLR-769.zip

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-16 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710087#action_12710087
]

Stanislaw Osinski commented on SOLR-769:

Thanks Grant! Looking forward to seeing the code in the repo!

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-16 Thread Allahbaksh Mohammedali (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710093#action_12710093
]

Allahbaksh Mohammedali commented on SOLR-769:
-

Hi Grant,
I am looking forward keenly to see this feature. I want to see it in action as
soon as possible. When the Code will be comitted to repo?

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-04-19 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700628#action_12700628
]

Grant Ingersoll commented on SOLR-769:
--

Where can we download nni.jar from?

Seems like if you only need two classes it would be easy enough to replace them
with your own code.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-04-16 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699908#action_12699908
]

Grant Ingersoll commented on SOLR-769:
--

Looks like we need to make the NNI JAR be a download, too, right? It appears
to be LGPL. Where does that library come from, anyway? I don't see it on
Carrot trunk, but it is in the zip. And a search for it doesn't reveal much.

-Grant

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-04-03 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695401#action_12695401
]

Grant Ingersoll commented on SOLR-769:
--

Hi Stanislaw,

I'm going to commit soon and I was wondering if Carrot2 has a handy place where
they keep all the licenses and notices so that I can fill out Solr's NOTICE.txt
and LICENSE.txt. If not, I will go collate them.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-04-03 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695463#action_12695463
]

Stanislaw Osinski commented on SOLR-769:

Hi Grant,

If you download http://download.carrot2.org/stable/carrot2-java-api-3.0.1.zip,
you'll find licenses in the lib/ folder of the distribution. That distribution
contains slightly more JARs than needed for Solr (which uses carrot2-mini.jar),
so you'd need to pick only those that are relevant.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-22 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688132#action_12688132
]

Grant Ingersoll commented on SOLR-769:
--

bq. Highlighting:

Hmm, that's an interesting thought. We could check to see if highlighting is
done first.

Also, you say C2 can handle full docs, is it feasible, then to implement it for
the offline mode I have in mind, whereby you cluster the whole collection
offline and then store the clusters for retrieval? I haven't implemented this
yet, but was thinking some people will be interested in full corpus clustering.
The nice thing, then, is that as new documents come in, they can be added to
existing clusters (and maybe periodically, we re-cluster). Just thinking
outloud.

Rest of the stuff in that comment sounds good. I will try out the patch.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-22 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688141#action_12688141
]

Grant Ingersoll commented on SOLR-769:
--

Should the MockClusteringAlgorithm be under the test source tree and not the
main one? I moved it in the patch to follow

I don't think we need to output the number of clusters, since that will be
obvious from the list size. I dropped it in the patch to follow

Also, on the response structure, we certainly could make it optional, although
it means having to go do a lookup in the real doc list, which could be less
than fun.

Patch to follow

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-22 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688171#action_12688171
]

Stanislaw Osinski commented on SOLR-769:

bq. Also, you say C2 can handle full docs, is it feasible, then to implement it
for the offline mode I have in mind, whereby you cluster the whole collection
offline and then store the clusters for retrieval? I haven't implemented this
yet, but was thinking some people will be interested in full corpus clustering.
The nice thing, then, is that as new documents come in, they can be added to
existing clusters (and maybe periodically, we re-cluster). Just thinking
outloud.

We have two variables here: the length of docs and the number of docs. Carrot2
is suitable for small numbers of docs (up to say 1000). If the docs are short
(a paragraph or so), the clustering should be pretty fast, suitable for on-line
processing (see: http://project.carrot2.org/algorithms.html). If the documents
get longer, Carrot2 will still handle them, but will require some more time for
processing, I'll try to do some measurements. But C2 is not useful for the
whole collection case -- it performs all processing in-memory and here we'd
need a totally different class of algorithm, something along the lines of
Mahout developments.

bq. Hmm, that's an interesting thought. We could check to see if highlighting
is done first.

To quickly summarise the pros and cons of relying on highlighting being done
outside of the clustering component:

Pros:

* we avoid duplication of processing (highlighting being done twice)
* simpler code of the clustering component, less configuration

Cons:

* if someone doesn't want highlighting in the search results, the clustering is
likely to take more time (because it operates on full documents, and it's
controlled globally)
* depending on the highlighter, we may get some markup in the summaries, which
may affect clustering (I'd need to check how Carrot2 handles that)

bq. Should the MockClusteringAlgorithm be under the test source tree and not
the main one? I moved it in the patch to follow

Absolutely, it should be in the test source.

bq. I don't think we need to output the number of clusters, since that will be
obvious from the list size. I dropped it in the patch to follow

Makes sense, I kept it because the original version had it.

bq. Also, on the response structure, we certainly could make it optional,
although it means having to go do a lookup in the real doc list, which could be
less than fun.

By lookup you mean the lookup in the XML response? Here again we have a trade
off between the length of the response and ease of processing: if we repeat
document titles / snippets in the clusters structure, we at least double the
response size (at least because the same document may belong to many clusters),
but can potentially save some lookups. But if we want to get some other fields
of a document (other than we repeat in the clusters list), we'd still need a
lookup.

To sum up, my intuition would be to avoid duplication and stick with document
ids in cluster list (this is what we do in Carrot2 XMLs as well). Optionally,
the clustering component could have a list of configurable fields to be
repeated in the cluster list if that's really helpful in real-word use cases.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 1.4

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-11 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680942#action_12680942
]

Stanislaw Osinski commented on SOLR-769:

Hi All,

I've just uploaded a patch that passes unit tests and has working example, but
this is by no means a final version. A few outstanding questions / issues:

# h4. Response structure.

I was wondering -- to we need to repeat the document contents in the 'clusters'
response section? Assuming that each document in the index has a unique ID, we
could reduce the size of the response by just referencing documents by IDs like
this:
\\
{code}
lst name=clusters
int name=numClusters3/int
lst name=cluster
lst name=labels
str name=labelGPU VPU Clocked/str
/lst
lst name=docs
str name=docEN7800GTX/2DHTV/256M/str
str name=doc100-435805/str
/lst
/lst
lst name=cluster
lst name=labels
str name=labelHard Drive/str
/lst
lst name=docs
str name=doc6H500F0/str
str name=docSP2514N/str
/lst
/lst
lst name=cluster
lst name=labels
str name=labelOther Topics/str
/lst
lst name=docs
str name=doc9885A004/str
/lst
/lst
{code}
Actually, this is what I've implemented in the patch.

Also, in case of hierarchical clusters I've introduced a grouping entity called
clusters so that the top- and sub-levels or the response are consistent (see
unit tests). Please let me know if this makes sense.

# h4 Build: compile warnings about missing SimpleXML

SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not
needed at runtime, but generates warnings about missing dependencies during
compile time. So the option is either to live with the warnings or to add
SimpleXML (version 1.7.2) to get rid of the warnings.

# h4 Build: copying of protowords.txt etc

The patch includes lexical files both in the
contrib/clustering/src/java/test/resources/ and in the examples dir. I'm
not sure how this is handled though -- do you keep copies in the repository or
copy those somehow in the build?

# h4 Highlighting

This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly
well handle full documents (up to say a few hundred kB each), it's just the
number of documents that must be in the order of hundreds. Therefore,
highlighting is not mandatory, but it may sometimes improve the quality of
clusters.

I was wondering, if highlighting is performed earlier in the Solr pipeline,
could this be reused during clustering? One possible approach could be that
clustering uses whatever is fed from the pipeline: if highlighting is enabled,
clustering will be performed on the highlighted content, if there was no
highlighting, we'd cluster full documents. Not sure if that's reasonable /
possible to implement though.

# h4 Documentation (wiki) updates

Once we stabilise the ideas, I'm happy to update the wiki with regard to the
algorithms used (Lingo/STC) and passing additional parameters.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Attachments: clustering-libs.tar, clustering-libs.tar,
SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-02-10 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672281#action_12672281
]

Stanislaw Osinski commented on SOLR-769:

Hi Grant,

I've added a Carrot2 issue referring to point 3 on your TODO list:
http://issues.carrot2.org/browse/CARROT-457. I'll be looking into this over the
weekend.

Staszek

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Attachments: clustering-libs.tar, clustering-libs.tar,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-23 Thread Bruce Ritchie (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642179#action_12642179
]

Bruce Ritchie commented on SOLR-769:

Grant,

This patch looks very promising, I can't wait to give it a try and find a way
to incorporate it into a project I'm working on (when it's ready of course ...
likely not till after Carrot2 3 is released though)

Can you give a quick estimate as to the performance impact of enabling
clustering in search results mode? In the example @
http://wiki.apache.org/solr/ClusteringFullResultsExample the query time seems
pretty high and I was wondering if that was a result of this patch or something
else?

Thanks,

Bruce Ritchie

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Attachments: clustering-libs.tar, clustering-libs.tar,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-23 Thread Stanislaw Osinski (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642182#action_12642182
]

Stanislaw Osinski commented on SOLR-769:

Bruce,

For performance of the clustering algorithm alone, please take a look at:
http://project.carrot2.org/algorithms.html
Obviously, you'd need to add the overhead of fetching the snippets / documents
from the index. Not sure how many are fetched and whether they come from Solr's
cache or not, so not sure if clustering or fetching time is prevailing.

Cheers,

Staszek

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Attachments: clustering-libs.tar, clustering-libs.tar,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-23 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642187#action_12642187
]

Grant Ingersoll commented on SOLR-769:
--

Hi Bruce,

I haven't done any perf. testing, as I've been focused on functionality first.
However, I'm not sure whether that query was the first one run, or not, so I
don't know the status of the searcher, etc. I'm pretty sure I don't have any
warming queries, etc.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Attachments: clustering-libs.tar, clustering-libs.tar,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-22 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641823#action_12641823
]

Grant Ingersoll commented on SOLR-769:
--

{quote}So what would be the procedure to add some clustering code beyond carrot
or other available libraries.
{quote}

Essentially, you need to implement either a SearchClusteringEngine or a
DocumentClusteringEngine and then hook declare it in the SearchComponent
configuration, as is done with the Carrot2 example here:
{code}
lst name=engine
!-- The name, only one can be named default --
str name=namedefault/str
!-- Carrot2 specific parameters. See the Carrot2 site for details on
setting. --
!-- carrot.algorithm: Optional. Currently only
lingo is supported pending the release of Carrot2 3.0.
--
str name=carrot.algorithmlingo/str
!-- Lingo specific --
float name=carrot.lingo.threshold.clusterAssignment0.150/float
float
name=carrot.lingo.threshold.candidateClusterThreshold0.775/float
/lst
{code}
or, in the mock setup:
{code}
lst name=engine
!-- The name, only one can be named default --
str name=namedocEngine/str
str
name=classnameorg.apache.solr.handler.clustering.MockDocumentClusteringEngine/str
/lst
{code}

If you don't declare the classname value, then it assumes the Carrot
implementation.

Naturally, you need to take care of all the libraries being available to Solr,
etc. just as you would for any plugin.

Since you are interested in clustering, Vaijanath, it would be good to get your
feedback on the APIs. Are you doing full document clustering or just search
snippet clustering? Also, if you are using an open source clustering library
that has acceptable licensing terms (i.e. not GPL or similar), perhaps consider
contributing an implementation of the engine and then we can make it available
to everyone.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Attachments: clustering-libs.tar, clustering-libs.tar,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-22 Thread Vaijanath N. Rao (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641832#action_12641832
]

Vaijanath N. Rao commented on SOLR-769:
---

Hi Grant,

Till now I have worked mostly with full document clustering. Had never thought
of search snippet clustering. I will definitely pitch in for clustering
library. There are many libraries which have favourable/acceptable licensing
terms which can be added to Solr.

--Thanks and Regards
Vaijanath

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Attachments: clustering-libs.tar, clustering-libs.tar,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-15 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12639835#action_12639835
]

Grant Ingersoll commented on SOLR-769:
--

Note, also, that even though I put in support for some of the other C2
(Carrot2) algorithms, I don't think they quite work yet. I think they require
passing in more parameters to set some algorithm properties (for instance, for
Fuzzy Ants, I think you need to set a depth) and I haven't figured those out
yet. If you have C2 experience, insight would be appreciated.

For now, stick to Lingo.

Support Document and Search Result clustering
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-12 Thread Andrzej Bialecki (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638875#action_12638875
]

Andrzej Bialecki commented on SOLR-769:

FYI, Carrot2 does support a handful of different clustering algorithms (the
ones I know of are Fuzzy Ants, KMeans and Suffix Tree, in addition to Lingo).

Support Document and Search Result clustering
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-12 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638924#action_12638924
]

Grant Ingersoll commented on SOLR-769:
--

Yeah, I probably will include the other jars and make it easy to include them.
For now, I wanted to get something basic working for a talk I'm giving on
Wednesday night ;-)

Support Document and Search Result clustering
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-11 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638791#action_12638791
]

Grant Ingersoll commented on SOLR-769:
--

Patch soon, as a start. I'm going to check in the basic directory structure
and libs, and then provide a patch with the source that we can iterate on.

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-11 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638814#action_12638814
]

Grant Ingersoll commented on SOLR-769:
--

Still to do, more testing, get feedback, implement basics of doc. clustering.
This last piece will take some more design work. Also need to validate some
more that the results make sense for search results clustering, but my first
look suggests they do.

Support Document and Search Result clustering
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-10 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638637#action_12638637
]

Grant Ingersoll commented on SOLR-769:
--

Starting docs at http://wiki.apache.org/solr/ClusteringComponent

Support Document and Search Result clustering
-

Key: SOLR-769
URL: https://issues.apache.org/jira/browse/SOLR-769
Project: Solr
Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

55 matches

Mail list logo