Re: Number of clustering labels to show

2015-05-29 Thread Stanislaw Osinski
Hi,

The number of clusters primarily depends on the parameters of the specific
clustering algorithm. If you're using the default Lingo algorithm, the
number of clusters is governed by
the LingoClusteringAlgorithm.desiredClusterCountBase parameter. Take a look
at the documentation (
https://cwiki.apache.org/confluence/display/solr/Result+Clustering#ResultClustering-TweakingAlgorithmSettings)
for some more details (the "Tweaking at Query-Time" section shows how to
pass the specific parameters at request time). A complete overview of the
Lingo clustering algorithm parameters is here:
http://doc.carrot2.org/#section.component.lingo.

Stanislaw

--
Stanislaw Osinski, stanislaw.osin...@carrotsearch.com
http://carrotsearch.com

On Fri, May 29, 2015 at 4:29 AM, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> I'm trying to increase the number of cluster result to be shown during the
> search. I tried to set carrot.fragSize=20 but only 15 cluster labels is
> shown. Even when I tried to set carrot.fragSize=5, there's also 15 labels
> shown.
>
> Is this the correct way to do this? I understand that setting it to 20
> might not necessary mean 20 lables will be shown, as the setting is for
> maximum number. But when I set this to 5, it should reduce the number of
> labels to 5?
>
> I'm using Solr 5.1.
>
>
> Regards,
> Edwin
>


Re: Parsing cluster result's docs

2015-03-09 Thread Stanislaw Osinski
Hi,


> I have a Solr instance using the clustering component (with the Lingo
> algorithm) working perfectly. However when I get back the cluster results
> only the ID's of these come back with it. What is the easiest way to
> retrieve full documents instead? Should I parse these IDs into a new query
> to Solr, or is there some configuration I am missing to return full docs
> instead of IDs?
>
> If it matters, I am using Solr 4.10.
>

Clustering results are attached to the regular Solr response (the text of
the documents), much like shown in the docs:
https://cwiki.apache.org/confluence/display/solr/Result+Clustering, so with
the default configuration you should be getting both clusters and document
content. If that's not the case, please post your solrconfig.xml and the
URL you're using to initiate the search/clustering.

Staszek


Re: Is it possible to cluster on search results but return only clusters?

2014-05-06 Thread Stanislaw Osinski
Hi Sebastián,

Looking quickly through the code of the clustering component, there's
currently no way to output only clusters. Let me see if this can be easily
implemented.

Stanislaw

--
Stanislaw Osinski, stanislaw.osin...@carrotsearch.com
http://carrotsearch.com


On Tue, May 6, 2014 at 6:48 PM, Paul Libbrecht  wrote:

> put rows to zero?
> Exploit the facets as "clusters" ?
>
> paul
>
>
> Le 6 mai 2014 à 16:42, Sebastián Ramírez 
> a écrit :
>
> > I have this query / URL
> >
> >
> http://example.com:8983/solr/collection1/clustering?q=%28title:%22+Atlantis%22+~100+OR+content:%22+Atlantis%22+~100%29&rows=3001&carrot.snippet=content&carrot.title=title&wt=xml&indent=true&sort=date+DESC&;
> >
> > With that, I get the results and also the clustering of those results.
> What
> > I want is just the clusters of the results, not the results, because
> > returning the results is consuming too much bandwidth.
> >
> > I know I can write a "proxy" script that gets the response from Solr and
> > then filters out the results and returns the clusters, but I first wanna
> > check if it's possible with just the parameters of Solr or Carrot.
> >
> > Thanks in advance,
> >
> >
> > *Sebastián Ramírez*
> > Diseñador de Algoritmos
> >
> > <http://www.senseta.com>
> > 
> > Tel: (+571) 795 7950 ext: 1012
> > Cel: (+57) 300 370 77 10
> > Calle 99 No. 14 - 76 Piso 5
> > Email: sebastian.rami...@senseta.com
> > www.senseta.com
> >
> > --
> > **
> > *This e-mail transmission, including any attachments, is intended only
> for
> > the named recipient(s) and may contain information that is privileged,
> > confidential and/or exempt from disclosure under applicable law. If you
> > have received this transmission in error, or are not the named
> > recipient(s), please notify Senseta immediately by return e-mail and
> > permanently delete this transmission, including any attachments.*
>
>


Re: [Clustering] Full-Index Offline cluster

2014-03-11 Thread Stanislaw Osinski
> Thank you Ahmet, Staszek and Tomnaso ;)
> so the only way to obtain offline Clustering is to move to a customisation
> !
> I will take a look to the interface of the API ( If you can give me a link
> to the class, it will be appreciated, If not I will find it by myself .
>

The API stub is
the org.apache.solr.handler.clustering.DocumentClusteringEngine class in
contrib/clustering. The API has not yet been implemented, so you may want
to tune the API to suit the way you'd like to arrange your full-index
clustering code.

S.


Re: [Clustering] Full-Index Offline cluster

2014-03-10 Thread Stanislaw Osinski
>
> Thats weird. As far as I know there is no such thing. There is
> classification stuff but I haven't heard of clustering.
>
> http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html


I think the wording on the wiki page needs some clarification -- Solr
contains an internal API interface for full index clustering, but the
interface is not yet implemented, so the only clustering mode available out
of the box is currently search results clustering (based on the Carrot2
library).

Staszek


Re: solrconfig.xml carrot2 params

2013-10-21 Thread Stanislaw Osinski
> Thanks, I'm new to the clustering libraries.  I finally made this
> connection when I started browsing through the carrot2 source.  I had
> pulled down a smaller MM document collection from our test environment.  It
> was not ideal as it was mostly structured, but small.  I foolishly thought
> I could cluster on the text copy field before realizing that it was index
> only.  Doh!
>

That is correct -- for the time being the clustering can only be applied to
stored Solr fields.



> Our documents are indexed in SolrCloud, but stored in HBase.  I want to
> allow users to page through Solr hits, but would like to cluster on all (or
> at least several thousand) of the top search hits.  Now I'm puzzling over
> how to efficiently cluster over possibly several thousand Solr hits when
> the documents are in HBase.  I thought an HBase coprocessor, but carrot2
> isn't designed for distributed computation.  Mahout, in the Hadoop M/R
> context, seems slow and heavy handed for this scale; maybe, I just need to
> dig deeper into their library.  Or I could just be missing something
> fundamental?  :)
>

Carrot2 algorithms were not designed to be distributed, but you can still
use them in a single-threaded scenario. To do this, you'd probably need to
write a bit of code that gets the text of your documents from your HBase
and runs Carrot2 clustering on it. If you use the STC clustering algorithm,
you should be able to process several thousands of documents in a
reasonable time (order of seconds). The clustering side of the code should
be a matter of a few lines of code (
http://download.carrot2.org/stable/javadoc/overview-summary.html#clustering-documents).
The tricky bit of the setup may be efficiently getting the text for
clustering -- it can happen that fetching can take longer than the actual
clustering.

S.


Re: solrconfig.xml carrot2 params

2013-10-18 Thread Stanislaw Osinski
Hi,

Out of curiosity -- what would you like to achieve by changing
Tokenizer.documentFields?
If you want to have clustering applied to more than one document field, you
can provide a comma-separated list of fields in the carrot.title and/or
carrot.snippet parameters.

Thanks,

Staszek

--
Stanislaw Osinski, stanislaw.osin...@carrotsearch.com
http://carrotsearch.com


On Thu, Oct 17, 2013 at 11:49 PM, youknow...@heroicefforts.net <
youknow...@heroicefforts.net> wrote:

> Would someone help me out with the syntax for setting
> Tokenizer.documentFields in the ClusteringComponent engine definition in
> solrconfig.xml?  Carrot2 is expecting a Collection of Strings.  There's no
> schema definition for this XML file and a big TODO on the Wiki wrt init
> params.  Every permutation I have tried results in an error stating:
>  Cannot set java.until.Collection field ... to java.lang.String.
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.


Re: News clustering

2012-12-03 Thread Stanislaw Osinski
> I mean measuring the similarity between the document in each cluster.
> Also, difference between document on one cluster with another cluster.
>
> I saw the sample code ClusteringQualityBencmark.java
> However, I do not know how to make use of it for assessing my Solr
> Clustering performance.
>

You'd need to write your own code for this, here are the most common
clustering quality measures you mentioned:

http://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_of_clustering_results

These are meant for the general case (numeric attributes), to apply them to
texts, you'd need to use the vector representation of the documents.

One a more general note, synthetic measures test only the document-cluster
assignments, but none take the quality of labels into account (this is
really hard to measure objectively).

Staszek


Re: News clustering

2012-12-03 Thread Stanislaw Osinski
> Was the picture generated using Lingo 3G algorihtms?
> I saw some sub-clusters inside it.
> Nice pic :)
>

That is correct.


I am interested to learn it.
> How long is the Lingo 3G trial period?
>

I'll send you the details in a private e-mail in a second.



> Is there any way to programmatically measure the performance of Carrot2
> clustering algorithm?
>

I'm not sure what you mean by performance. Measuring clustering time is
pretty straightforward, measuring the quality of clusters is not, a lot
depends on your specific data and application.

Staszek


Re: News clustering

2012-12-03 Thread Stanislaw Osinski
One of our clients uses Solr's search results clustering for grouping news.
Instead of the default Carrot2 algorithm that ships with Solr they use a
commercial one, but Carrot2 should give you decent clusters too. Here's an
example clustering result:

http://imagebin.org/238001

Staszek

--
Stanislaw Osinski
http://carrotsearch.com

On Fri, Nov 30, 2012 at 4:44 PM, Jorge Luis Betancourt Gonzalez <
jlbetanco...@uci.cu> wrote:

> Hi all:
>
> I'm thinking on using nutch combined with solr to index some news sites in
> an intranet. And I was wondering how effective could be using the
> clustering component to cluster the search results? Any success history on
> using solr clustering component for news clustering? Any existing solution
> for clustering/classification on index time?
>
> Greetings!
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>


Re: document clustering or tagging

2012-11-18 Thread Stanislaw Osinski
Stanislaw Osinski, stanislaw.osin...@carrotsearch.com
http://carrotsearch.com

I have very huge solr index. I want to tag all documents with terms that
> better represent that document like  this
> <
> http://search.carrotsearch.com/carrot2-webapp/search?source=web&view=folders&skin=fancy-compact&query=rugby+in+london&results=100&algorithm=lingo3g&EToolsDocumentSource.country=ALL&EToolsDocumentSource.language=ALL&EToolsDocumentSource.safeSearch=false
> >
> . Does this type of clustering results is also come under document tagging?
>

No, this type of clustering will not solve your problem because it's suited
for small/medium collections of documents (search results) rather than the
whole index. For your specific problem I'd recommend some keyword /
keyphrase extractor, which would generate tags for each document separately.

Staszek


Re: Carrot2 using rawtext of field for clustering

2012-06-08 Thread Stanislaw Osinski
>
> Is there any workaround in Solr/Carrot2 So that we could pass tokens that'd
> been filtered with customer tokenizer/filters instead of rawtext that it
> currently
> uses for clustering ?
>
> I read an issue in following link too .
>
> https://issues.apache.org/jira/browse/SOLR-2917
>
>
> Is writing our own parsers to filter text documents before indexing to SOLR
> could be only the right approach currently ? please let me know if anyone
> have come across this issue and have other better suggestions?
>

Until SOLR-2917 is resolved, this solutions seems the easiest to implement.
Alternatively, you could provide a custom implementation of Carrot2's
tokenizer (
http://download.carrot2.org/stable/javadoc/org/carrot2/text/analysis/ITokenizer.html)
through the appropriate factory attribute (
http://doc.carrot2.org/#section.attribute.lingo.PreprocessingPipeline.tokenizerFactory).
The custom implementation would need to apply the required filtering.

Regardless of the approach, one thing to keep in mind is that Carrot2 draws
labels from the input text, so if your filtered stream omits e.g.
prepositions, the labels will be less readable.

Staszek


Re: System requirements in my case?

2012-05-22 Thread Stanislaw Osinski
>
> 3) Measure the size of the index folder, multiply with 8 to get a clue of
>> total index size
>>
> With 12 000 docs my index folder size is: 33Mo
> ps: I use "solr.clustering.enabled=true"


Clustering is performed at search time, it doesn't affect the size of the
index (but obviously it does affect the search response times).

Staszek


Re: Newbie with Carrot2?

2012-05-22 Thread Stanislaw Osinski
Hi Bruno,

Just to confirm -- are you seeing the clusters array in the result at all
()? To get reasonable clusters, you should request at
least 30-50 documents (rows), but even with smaller values, you should see
an empty clusters array.

Staszek

On Sun, May 20, 2012 at 9:20 PM, Bruno Mannina  wrote:

> Le 20/05/2012 11:43, Stanislaw Osinski a écrit :
>
>  Hi Bruno,
>>
>> Here's the wiki documentation for Solr's clustering component:
>>
>> http://wiki.apache.org/solr/**ClusteringComponent<http://wiki.apache.org/solr/ClusteringComponent>
>>
>> For configuration examples, take a look at the Configuration section:
>> http://wiki.apache.org/solr/**ClusteringComponent#**Configuration<http://wiki.apache.org/solr/ClusteringComponent#Configuration>
>> .
>>
>> If you hit any problems, let me know.
>>
>> Staszek
>>
>> On Sun, May 20, 2012 at 11:38 AM, Bruno Mannina  wrote:
>>
>>  Dear all,
>>>
>>> I use Solr 3.6.0 and I indexed some documents (around 12000).
>>> Each documents contains a Abstract-en field (and some other fields).
>>>
>>> Is it possible to use Carrot2 to create cluster (classes) with the
>>> Abstract-en field?
>>>
>>> What must I configure in the schema.xml ? or in other files?
>>>
>>> Sorry for my newbie question, but I found only documentation for
>>> Workbench
>>> tool.
>>>
>>> Bruno
>>>
>>>  Thx for this link but I have a problem to configure my solrconfig.xml
> in the section:
> (note I run java -Dsolr.clustering.enabled=**true)
>
> I have a field named abstract-en, and I would like to use only this field.
>
> I would like to know if my requestHandler is good?
> I have a doubt with the content of  : carrot.title, carrot.url
>
> and also the latest field
> abstract-en
> edismax
> 
>  abstract-en^1.0
> 
> *:*
> 10
> *,score
>
> because the result when I do a request is exactly like a search request
> (without more information)
>
>
> My entire requestHandler is:
>
>  enable="${solr.clustering.**enabled:false}" class="solr.SearchHandler">
> 
> true
> **default
> **true
> 
> name
> id
> 
> **abstract-en
> 
> **true
> 
> 
> 
> false
> abstract-en
> edismax
> 
>  abstract-en^1.0
> 
> *:*
> 10
> *,score
> 
> 
> clustering
> 
> 
>
>


Re: using Carrot2 custom ITokenizerFactory

2012-05-21 Thread Stanislaw Osinski
process(Controller.java:333)
>at org.carrot2.core.Controller.**process(Controller.java:240)
>at org.apache.solr.handler.**clustering.carrot2.**
> CarrotClusteringEngine.**cluster(**CarrotClusteringEngine.java:**220)
>... 24 more
> Caused by: org.carrot2.util.attribute.**AttributeBindingException: Could
> not assign field org.carrot2.text.**preprocessing.pipeline.**
> CompletePreprocessingPipeline#**tokenizerFactory with value
> org.apache.solr.handler.**clustering.carrot2.**
> LuceneCarrot2TokenizerFactory
>at org.carrot2.util.attribute.**AttributeBinder$**
> AttributeBinderActionBind.**performAction(AttributeBinder.**java:614)
>at org.carrot2.util.attribute.**AttributeBinder.bind(**
> AttributeBinder.java:311)
>at org.carrot2.util.attribute.**AttributeBinder.bind(**
> AttributeBinder.java:349)
>at org.carrot2.util.attribute.**AttributeBinder.bind(**
> AttributeBinder.java:219)
>at org.carrot2.util.attribute.**AttributeBinder.set(**
> AttributeBinder.java:149)
>at org.carrot2.util.attribute.**AttributeBinder.set(**
> AttributeBinder.java:129)
>at org.carrot2.core.**ControllerUtils.init(**
> ControllerUtils.java:50)
>at org.carrot2.core.**PoolingProcessingComponentMana**ger$**
> ComponentInstantiationListener**.objectInstantiated(**
> PoolingProcessingComponentMana**ger.java:189)
>... 30 more
> Caused by: java.lang.**IllegalArgumentException: Can not set
> org.carrot2.text.linguistic.**ITokenizerFactory field org.carrot2.text.**
> preprocessing.pipeline.**BasicPreprocessingPipeline.**tokenizerFactory to
> java.lang.String
>at sun.reflect.**UnsafeFieldAccessorImpl.**
> throwSetIllegalArgumentExcepti**on(UnsafeFieldAccessorImpl.**java:146)
>at sun.reflect.**UnsafeFieldAccessorImpl.**
> throwSetIllegalArgumentExcepti**on(UnsafeFieldAccessorImpl.**java:150)
>at sun.reflect.**UnsafeObjectFieldAccessorImpl.**set(**
> UnsafeObjectFieldAccessorImpl.**java:63)
>at java.lang.reflect.Field.set(**Field.java:657)
>at org.carrot2.util.attribute.**AttributeBinder$**
> AttributeBinderActionBind.**performAction(AttributeBinder.**java:610)
>... 37 more
>
>
> I should dig in, but if you have any clue, it would be appreciated. I'm
> using 3.6 branch.
>
>
> koji
> --
> Query Log Visualizer for Apache Solr
> http://soleami.com/
>
> (12/05/20 21:11), Stanislaw Osinski wrote:
>
>> Hi Koji,
>>
>> It's fixed in trunk and 3.6.1 branch now. If you hit any other issues with
>> this, let me know.
>>
>> Staszek
>>
>> On Sun, May 20, 2012 at 1:02 PM, Koji Sekiguchi
>>  wrote:
>>
>>  Hi Staszek,
>>>
>>> I'll wait your fix. Thank you!
>>>
>>> Koji Sekiguchi from iPad2
>>>
>>> On 2012/05/20, at 18:18, Stanislaw Osinski
>>>  wrote:
>>>
>>>  Hi Koji,
>>>>
>>>> You're right, the current code overwrites the custom tokenizer though it
>>>> shouldn't. LuceneCarrot2TokenizerFactory is there to avoid circular
>>>> dependencies (Carrot2 default tokenizer depends on Lucene), but it
>>>> shouldn't be an issue with custom tokenizers.
>>>>
>>>> I'll try to commit a fix later today. Meanwhile, if you have a chance to
>>>> recompile the code, a temporary solution would be to hardcode your
>>>> tokenizer class into the fragment you pasted:
>>>>
>>>>   BasicPreprocessingPipelineDesc**riptor.attributeBuilder(**
>>>> initAttributes)
>>>>   .stemmerFactory(**LuceneCarrot2StemmerFactory.**class)
>>>>   .tokenizerFactory(**YourCustomTokenizer.class)
>>>>   .lexicalDataFactory(**SolrStopwordsCarrot2LexicalDat**
>>>> aFactory.class);
>>>>
>>>> Staszek
>>>>
>>>> On Sun, May 20, 2012 at 9:40 AM, Koji Sekiguchi
>>>>
>>> wrote:
>>>
>>>>
>>>>  Hello,
>>>>>
>>>>> As I'd like to use custom ITokenizerFactory, I set the following
>>>>> Carrot2
>>>>> key
>>>>> in solrconfig.xml:
>>>>>
>>>>> >>>>  enable="${solr.clustering.**enabled:true}"
>>>>>  class="solr.clustering.**ClusteringComponent">
>>>>>   
>>>>> default
>>>>>:
>>>>> >>>>
>>>>>  name="PreprocessingPipeline.**tokenizerFactory">my.own.**
>>> TokenizerFactory
>>>
>>>>   
>>>>> 
>>>>>
>>>>> But seems that CarrotClusteringEngine overwrites it with
>>>>> LuceneCarrot2TokenizerFactory
>>>>> in init() method:
>>>>>
>>>>>   BasicPreprocessingPipelineDesc**riptor.attributeBuilder(**
>>>>> initAttributes)
>>>>>   .stemmerFactory(**LuceneCarrot2StemmerFactory.**class)
>>>>>   .tokenizerFactory(**LuceneCarrot2TokenizerFactory.**class)
>>>>>   .lexicalDataFactory(**SolrStopwordsCarrot2LexicalDat**
>>>>> aFactory.class);
>>>>>
>>>>> Am I missing something?
>>>>>
>>>>> koji
>>>>> --
>>>>> Query Log Visualizer for Apache Solr
>>>>> http://soleami.com/
>>>>>
>>>>>
>>>
>>
>


Re: using Carrot2 custom ITokenizerFactory

2012-05-20 Thread Stanislaw Osinski
After a bit of digging: the error message in the exception is a bit
misleading, but what really happens is that the code cannot load
the org.apache.solr.handler.clustering.carrot2.LuceneCarrot2TokenizerFactory
class. The class is being loaded by Carrot2 code (
https://github.com/carrot2/carrot2/blob/master/core/carrot2-util-common/src/org/carrot2/util/ReflectionUtils.java#L47),
which doesn't seem to play well with how Solr loads classes. We'll be
looking for ways to properly fix it, any hints would be helpful.

Meanwhile, a quick and dirty way of fixing the config would be to make the
clustering component and Carrot2 JARs available to the context classloader
by copying them to WEB-INF/lib of the WAR.

Staszek

On Sun, May 20, 2012 at 6:16 PM, Stanislaw Osinski <
stanislaw.osin...@carrotsearch.com> wrote:

> Interesting... let me investigate.
>
> S.
>
>
> On Sun, May 20, 2012 at 5:15 PM, Koji Sekiguchi wrote:
>
>> Hi Staszek,
>>
>> Thank you for the fix so quickly!
>>
>> As a trial, I set:
>>
>> org.apache.**
>> solr.handler.clustering.**carrot2.**LuceneCarrot2TokenizerFactory<**/str>
>>
>> then I could start Solr without error. But when I make a request:
>>
>> http://localhost:8983/solr/**clustering?q=*%3A*&version=2.**
>> 2&start=0&rows=10&indent=on&**wt=json&fl=id&carrot.**produceSummary=false<http://localhost:8983/solr/clustering?q=*%3A*&version=2.2&start=0&rows=10&indent=on&wt=json&fl=id&carrot.produceSummary=false>
>>
>> I got an exception:
>>
>> org.apache.solr.common.**SolrException: Carrot2 clustering failed
>>at org.apache.solr.handler.**clustering.carrot2.**
>> CarrotClusteringEngine.**cluster(**CarrotClusteringEngine.java:**224)
>>at org.apache.solr.handler.**clustering.**
>> ClusteringComponent.process(**ClusteringComponent.java:91)
>>at org.apache.solr.handler.**component.SearchHandler.**
>> handleRequestBody(**SearchHandler.java:186)
>>at org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
>> RequestHandlerBase.java:129)
>>at org.apache.solr.core.**RequestHandlers$**
>> LazyRequestHandlerWrapper.**handleRequest(RequestHandlers.**java:244)
>>at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1376)
>>at org.apache.solr.servlet.**SolrDispatchFilter.execute(**
>> SolrDispatchFilter.java:365)
>>at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>> SolrDispatchFilter.java:260)
>>at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
>> doFilter(ServletHandler.java:**1212)
>>at org.mortbay.jetty.servlet.**ServletHandler.handle(**
>> ServletHandler.java:399)
>>at org.mortbay.jetty.security.**SecurityHandler.handle(**
>> SecurityHandler.java:216)
>>at org.mortbay.jetty.servlet.**SessionHandler.handle(**
>> SessionHandler.java:182)
>>at org.mortbay.jetty.handler.**ContextHandler.handle(**
>> ContextHandler.java:766)
>>at org.mortbay.jetty.webapp.**WebAppContext.handle(**
>> WebAppContext.java:450)
>>at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(*
>> *ContextHandlerCollection.java:**230)
>>at org.mortbay.jetty.handler.**HandlerCollection.handle(**
>> HandlerCollection.java:114)
>>at org.mortbay.jetty.handler.**HandlerWrapper.handle(**
>> HandlerWrapper.java:152)
>>at org.mortbay.jetty.Server.**handle(Server.java:326)
>>at org.mortbay.jetty.**HttpConnection.handleRequest(**
>> HttpConnection.java:542)
>>at org.mortbay.jetty.**HttpConnection$RequestHandler.**
>> headerComplete(HttpConnection.**java:928)
>>at org.mortbay.jetty.HttpParser.**parseNext(HttpParser.java:549)
>>at org.mortbay.jetty.HttpParser.**parseAvailable(HttpParser.**
>> java:212)
>>at org.mortbay.jetty.**HttpConnection.handle(**
>> HttpConnection.java:404)
>>at org.mortbay.jetty.bio.**SocketConnector$Connection.**
>> run(SocketConnector.java:228)
>>at org.mortbay.thread.**QueuedThreadPool$PoolThread.**
>> run(QueuedThreadPool.java:582)
>> Caused by: org.carrot2.core.**ComponentInitializationExcepti**on:
>> org.carrot2.util.attribute.**AttributeBindingException: Could not assign
>> field org.carrot2.text.**preprocessing.pipeline.**
>> CompletePreprocessingPipeline#**tokenizerFactory with value
>> org.apache.solr.handler.**clustering.carrot2.**
>> LuceneCarrot2TokenizerFactory
>>at sun.reflect.**NativeConstructorAccessorImpl.**newInstance0(Native
>> Method)
&

Re: using Carrot2 custom ITokenizerFactory

2012-05-20 Thread Stanislaw Osinski
.carrot2.core.Controller.**process(Controller.java:240)
>at org.apache.solr.handler.**clustering.carrot2.**
> CarrotClusteringEngine.**cluster(**CarrotClusteringEngine.java:**220)
>... 24 more
> Caused by: org.carrot2.util.attribute.**AttributeBindingException: Could
> not assign field org.carrot2.text.**preprocessing.pipeline.**
> CompletePreprocessingPipeline#**tokenizerFactory with value
> org.apache.solr.handler.**clustering.carrot2.**
> LuceneCarrot2TokenizerFactory
>at org.carrot2.util.attribute.**AttributeBinder$**
> AttributeBinderActionBind.**performAction(AttributeBinder.**java:614)
>at org.carrot2.util.attribute.**AttributeBinder.bind(**
> AttributeBinder.java:311)
>at org.carrot2.util.attribute.**AttributeBinder.bind(**
> AttributeBinder.java:349)
>at org.carrot2.util.attribute.**AttributeBinder.bind(**
> AttributeBinder.java:219)
>at org.carrot2.util.attribute.**AttributeBinder.set(**
> AttributeBinder.java:149)
>at org.carrot2.util.attribute.**AttributeBinder.set(**
> AttributeBinder.java:129)
>at org.carrot2.core.**ControllerUtils.init(**
> ControllerUtils.java:50)
>at org.carrot2.core.**PoolingProcessingComponentMana**ger$**
> ComponentInstantiationListener**.objectInstantiated(**
> PoolingProcessingComponentMana**ger.java:189)
>... 30 more
> Caused by: java.lang.**IllegalArgumentException: Can not set
> org.carrot2.text.linguistic.**ITokenizerFactory field org.carrot2.text.**
> preprocessing.pipeline.**BasicPreprocessingPipeline.**tokenizerFactory to
> java.lang.String
>at sun.reflect.**UnsafeFieldAccessorImpl.**
> throwSetIllegalArgumentExcepti**on(UnsafeFieldAccessorImpl.**java:146)
>at sun.reflect.**UnsafeFieldAccessorImpl.**
> throwSetIllegalArgumentExcepti**on(UnsafeFieldAccessorImpl.**java:150)
>at sun.reflect.**UnsafeObjectFieldAccessorImpl.**set(**
> UnsafeObjectFieldAccessorImpl.**java:63)
>at java.lang.reflect.Field.set(**Field.java:657)
>at org.carrot2.util.attribute.**AttributeBinder$**
> AttributeBinderActionBind.**performAction(AttributeBinder.**java:610)
>... 37 more
>
>
> I should dig in, but if you have any clue, it would be appreciated. I'm
> using 3.6 branch.
>
>
> koji
> --
> Query Log Visualizer for Apache Solr
> http://soleami.com/
>
> (12/05/20 21:11), Stanislaw Osinski wrote:
>
>> Hi Koji,
>>
>> It's fixed in trunk and 3.6.1 branch now. If you hit any other issues with
>> this, let me know.
>>
>> Staszek
>>
>> On Sun, May 20, 2012 at 1:02 PM, Koji Sekiguchi
>>  wrote:
>>
>>  Hi Staszek,
>>>
>>> I'll wait your fix. Thank you!
>>>
>>> Koji Sekiguchi from iPad2
>>>
>>> On 2012/05/20, at 18:18, Stanislaw Osinski
>>>  wrote:
>>>
>>>  Hi Koji,
>>>>
>>>> You're right, the current code overwrites the custom tokenizer though it
>>>> shouldn't. LuceneCarrot2TokenizerFactory is there to avoid circular
>>>> dependencies (Carrot2 default tokenizer depends on Lucene), but it
>>>> shouldn't be an issue with custom tokenizers.
>>>>
>>>> I'll try to commit a fix later today. Meanwhile, if you have a chance to
>>>> recompile the code, a temporary solution would be to hardcode your
>>>> tokenizer class into the fragment you pasted:
>>>>
>>>>   BasicPreprocessingPipelineDesc**riptor.attributeBuilder(**
>>>> initAttributes)
>>>>   .stemmerFactory(**LuceneCarrot2StemmerFactory.**class)
>>>>   .tokenizerFactory(**YourCustomTokenizer.class)
>>>>   .lexicalDataFactory(**SolrStopwordsCarrot2LexicalDat**
>>>> aFactory.class);
>>>>
>>>> Staszek
>>>>
>>>> On Sun, May 20, 2012 at 9:40 AM, Koji Sekiguchi
>>>>
>>> wrote:
>>>
>>>>
>>>>  Hello,
>>>>>
>>>>> As I'd like to use custom ITokenizerFactory, I set the following
>>>>> Carrot2
>>>>> key
>>>>> in solrconfig.xml:
>>>>>
>>>>> >>>>  enable="${solr.clustering.**enabled:true}"
>>>>>  class="solr.clustering.**ClusteringComponent">
>>>>>   
>>>>> default
>>>>>:
>>>>> >>>>
>>>>>  name="PreprocessingPipeline.**tokenizerFactory">my.own.**
>>> TokenizerFactory
>>>
>>>>   
>>>>> 
>>>>>
>>>>> But seems that CarrotClusteringEngine overwrites it with
>>>>> LuceneCarrot2TokenizerFactory
>>>>> in init() method:
>>>>>
>>>>>   BasicPreprocessingPipelineDesc**riptor.attributeBuilder(**
>>>>> initAttributes)
>>>>>   .stemmerFactory(**LuceneCarrot2StemmerFactory.**class)
>>>>>   .tokenizerFactory(**LuceneCarrot2TokenizerFactory.**class)
>>>>>   .lexicalDataFactory(**SolrStopwordsCarrot2LexicalDat**
>>>>> aFactory.class);
>>>>>
>>>>> Am I missing something?
>>>>>
>>>>> koji
>>>>> --
>>>>> Query Log Visualizer for Apache Solr
>>>>> http://soleami.com/
>>>>>
>>>>>
>>>
>>
>


Re: using Carrot2 custom ITokenizerFactory

2012-05-20 Thread Stanislaw Osinski
Hi Koji,

It's fixed in trunk and 3.6.1 branch now. If you hit any other issues with
this, let me know.

Staszek

On Sun, May 20, 2012 at 1:02 PM, Koji Sekiguchi  wrote:

> Hi Staszek,
>
> I'll wait your fix. Thank you!
>
> Koji Sekiguchi from iPad2
>
> On 2012/05/20, at 18:18, Stanislaw Osinski  wrote:
>
> > Hi Koji,
> >
> > You're right, the current code overwrites the custom tokenizer though it
> > shouldn't. LuceneCarrot2TokenizerFactory is there to avoid circular
> > dependencies (Carrot2 default tokenizer depends on Lucene), but it
> > shouldn't be an issue with custom tokenizers.
> >
> > I'll try to commit a fix later today. Meanwhile, if you have a chance to
> > recompile the code, a temporary solution would be to hardcode your
> > tokenizer class into the fragment you pasted:
> >
> >   BasicPreprocessingPipelineDescriptor.attributeBuilder(initAttributes)
> >   .stemmerFactory(LuceneCarrot2StemmerFactory.class)
> >   .tokenizerFactory(YourCustomTokenizer.class)
> >   .lexicalDataFactory(SolrStopwordsCarrot2LexicalDataFactory.class);
> >
> > Staszek
> >
> > On Sun, May 20, 2012 at 9:40 AM, Koji Sekiguchi 
> wrote:
> >
> >> Hello,
> >>
> >> As I'd like to use custom ITokenizerFactory, I set the following Carrot2
> >> key
> >> in solrconfig.xml:
> >>
> >>  >>  enable="${solr.clustering.enabled:true}"
> >>  class="solr.clustering.ClusteringComponent" >
> >>   
> >> default
> >>:
> >>  >>
> name="PreprocessingPipeline.tokenizerFactory">my.own.TokenizerFactory
> >>   
> >> 
> >>
> >> But seems that CarrotClusteringEngine overwrites it with
> >> LuceneCarrot2TokenizerFactory
> >> in init() method:
> >>
> >>   BasicPreprocessingPipelineDescriptor.attributeBuilder(initAttributes)
> >>   .stemmerFactory(LuceneCarrot2StemmerFactory.class)
> >>   .tokenizerFactory(LuceneCarrot2TokenizerFactory.class)
> >>   .lexicalDataFactory(SolrStopwordsCarrot2LexicalDataFactory.class);
> >>
> >> Am I missing something?
> >>
> >> koji
> >> --
> >> Query Log Visualizer for Apache Solr
> >> http://soleami.com/
> >>
>


Re: Newbie with Carrot2?

2012-05-20 Thread Stanislaw Osinski
Hi Bruno,

Here's the wiki documentation for Solr's clustering component:

http://wiki.apache.org/solr/ClusteringComponent

For configuration examples, take a look at the Configuration section:
http://wiki.apache.org/solr/ClusteringComponent#Configuration.

If you hit any problems, let me know.

Staszek

On Sun, May 20, 2012 at 11:38 AM, Bruno Mannina  wrote:

> Dear all,
>
> I use Solr 3.6.0 and I indexed some documents (around 12000).
> Each documents contains a Abstract-en field (and some other fields).
>
> Is it possible to use Carrot2 to create cluster (classes) with the
> Abstract-en field?
>
> What must I configure in the schema.xml ? or in other files?
>
> Sorry for my newbie question, but I found only documentation for Workbench
> tool.
>
> Bruno
>


Re: using Carrot2 custom ITokenizerFactory

2012-05-20 Thread Stanislaw Osinski
Hi Koji,

You're right, the current code overwrites the custom tokenizer though it
shouldn't. LuceneCarrot2TokenizerFactory is there to avoid circular
dependencies (Carrot2 default tokenizer depends on Lucene), but it
shouldn't be an issue with custom tokenizers.

I'll try to commit a fix later today. Meanwhile, if you have a chance to
recompile the code, a temporary solution would be to hardcode your
tokenizer class into the fragment you pasted:

   BasicPreprocessingPipelineDescriptor.attributeBuilder(initAttributes)
   .stemmerFactory(LuceneCarrot2StemmerFactory.class)
   .tokenizerFactory(YourCustomTokenizer.class)
   .lexicalDataFactory(SolrStopwordsCarrot2LexicalDataFactory.class);

Staszek

On Sun, May 20, 2012 at 9:40 AM, Koji Sekiguchi  wrote:

> Hello,
>
> As I'd like to use custom ITokenizerFactory, I set the following Carrot2
> key
> in solrconfig.xml:
>
> enable="${solr.clustering.enabled:true}"
>   class="solr.clustering.ClusteringComponent" >
>
>  default
> :
>   name="PreprocessingPipeline.tokenizerFactory">my.own.TokenizerFactory
>
>  
>
> But seems that CarrotClusteringEngine overwrites it with
> LuceneCarrot2TokenizerFactory
> in init() method:
>
>BasicPreprocessingPipelineDescriptor.attributeBuilder(initAttributes)
>.stemmerFactory(LuceneCarrot2StemmerFactory.class)
>.tokenizerFactory(LuceneCarrot2TokenizerFactory.class)
>.lexicalDataFactory(SolrStopwordsCarrot2LexicalDataFactory.class);
>
> Am I missing something?
>
> koji
> --
> Query Log Visualizer for Apache Solr
> http://soleami.com/
>


Re: Old Google Guava library needs updating (r05)

2012-03-26 Thread Stanislaw Osinski
I've filed an issue for myself as a reminder. Guava r05 is pretty old
indeed, time to upgrade.

S.

On Mon, Mar 26, 2012 at 23:12, Nicholas Ball wrote:

>
> Hey Staszek,
>
> Thanks for the reply. Yep using 4.x and that was exactly what I ended up
> doing, a quick replace :)
> Just thought I'd document it somewhere for a proper fix to be done in the
> 4.0 release.
>
> No issues arose for me but then again Erick mentions it's only used in
> Carrot2 contrib which I'm not using in my deployment.
>
> Thanks for the help!
> Nick
>
> On Mon, 26 Mar 2012 22:40:14 +0200, Stanislaw Osinski
>  wrote:
> > Hi Nick,
> >
> > Which version of Solr do you have in mind? The official 3.x line or 4.0?
> >
> > The quick and dirty fix to try would be to just replace Guava r05 with
> the
> > latest version, chances are it will work (we did that in the past though
> > the version number difference was smaller).
> >
> > The proper fix would be for us to make a point release of Carrot2 with
> > dependencies updated and update Carrot2 in Solr. And this brings us to
> the
> > question about the version of Solr you use. Upgrading Carrot2 in 4.0
> > shouldn't be an issue, but when it comes to 3.x I'd need to check.
> >
> > Staszek
> >
> > On Mon, Mar 26, 2012 at 13:10, Erick Erickson
> > wrote:
> >
> >> Hmmm, near as I can tell, guava is only used in the Carrot2 contrib, so
> >> maybe
> >> ask over at: http://project.carrot2.org/?
> >>
> >> Best
> >> Erick
> >>
> >> On Sat, Mar 24, 2012 at 3:31 PM, Nicholas Ball
> >>  wrote:
> >> >
> >> > Hey all,
> >> >
> >> > Working on a plugin, which uses the Curator library (ZooKeeper
> client).
> >> > Curator depends on the very latest Google Guava library which
> >> unfortunately
> >> > clashes with Solr's outdated r05 of Guava.
> >> > Think it's safe to say that Solr should be using the very latest
> Guava
> >> > library (11.0.1) too right?
> >> > Shall I open up a JIRA issue for someone to update it?
> >> >
> >> > Cheers,
> >> > Nick
> >>
>


Re: Old Google Guava library needs updating (r05)

2012-03-26 Thread Stanislaw Osinski
Hi Nick,

Which version of Solr do you have in mind? The official 3.x line or 4.0?

The quick and dirty fix to try would be to just replace Guava r05 with the
latest version, chances are it will work (we did that in the past though
the version number difference was smaller).

The proper fix would be for us to make a point release of Carrot2 with
dependencies updated and update Carrot2 in Solr. And this brings us to the
question about the version of Solr you use. Upgrading Carrot2 in 4.0
shouldn't be an issue, but when it comes to 3.x I'd need to check.

Staszek

On Mon, Mar 26, 2012 at 13:10, Erick Erickson wrote:

> Hmmm, near as I can tell, guava is only used in the Carrot2 contrib, so
> maybe
> ask over at: http://project.carrot2.org/?
>
> Best
> Erick
>
> On Sat, Mar 24, 2012 at 3:31 PM, Nicholas Ball
>  wrote:
> >
> > Hey all,
> >
> > Working on a plugin, which uses the Curator library (ZooKeeper client).
> > Curator depends on the very latest Google Guava library which
> unfortunately
> > clashes with Solr's outdated r05 of Guava.
> > Think it's safe to say that Solr should be using the very latest Guava
> > library (11.0.1) too right?
> > Shall I open up a JIRA issue for someone to update it?
> >
> > Cheers,
> > Nick
>


Re: Solr 3.5.0 can't find Carrot classes

2012-01-26 Thread Stanislaw Osinski
Hi,

Can you paste the logs from the second run?

Thanks,

Staszek

On Wed, Jan 25, 2012 at 00:12, Christopher J. Bottaro  wrote:

> On Tuesday, January 24, 2012 at 3:07 PM, Christopher J. Bottaro wrote:
> > SEVERE: java.lang.NoClassDefFoundError:
> org/carrot2/core/ControllerFactory
> > at
> org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.(CarrotClusteringEngine.java:102)
> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> > at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
> Source)
> > at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
> > at java.lang.reflect.Constructor.newInstance(Unknown Source)
> > at java.lang.Class.newInstance0(Unknown Source)
> > at java.lang.Class.newInstance(Unknown Source)
> >
> > …
> >
> > I'm starting Solr with -Dsolr.clustering.enabled=true and I can see that
> the Carrot jars in contrib are getting loaded.
> >
> > Full log file is here:
> http://onespot-development.s3.amazonaws.com/solr.log
> >
> > Any ideas?  Thanks for the help.
> >
> Ok, got a little further.  Seems that Solr doesn't like it if you include
> jars more than once (I had a lib dir and also  directives in the
> solrconfig which ended up loading the same jars twice).
>
> But now I'm getting these errors:  java.lang.NoClassDefFoundError:
> org/apache/solr/handler/clustering/SearchClusteringEngine
>
> Any help?  Thanks.


Re: Weird docs-id clustering output in Solr 1.4.1

2011-12-01 Thread Stanislaw Osinski
Hi Vadim,

I've had limited connectivity, so I couldn't check out the complete 1.4.1
code and test the changes. Here's what you can try:

In this file:

http://svn.apache.org/viewvc/lucene/solr/tags/release-1.4.1/contrib/clustering/src/main/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngine.java?revision=957515&view=markup

around line 216 you will see:

for (Document doc : docs) {
  docList.add(doc.getField("solrId"));
}

You need to change this to:

for (Document doc : docs) {
  docList.add(doc.getField("solrId").toString());
}

Let me know if this did the trick.

Cheers,

S.

On Thu, Dec 1, 2011 at 10:43, Vadim Kisselmann
wrote:

> Hi Stanislaw,
> did you already have time to create a patch?
> If not, can you tell me please which lines in which class in source code
> are relevant?
> Thanks and regards
> Vadim Kisselmann
>
>
>
> 2011/11/29 Vadim Kisselmann 
>
> > Hi,
> > the quick and dirty way sound good:)
> > It would be great if you can send me a patch for 1.4.1.
> >
> >
> > By the way, i tested Solr. 3.5 with my 1.4.1 test index.
> > I can search and optimize, but clustering doesn't work (java.lang.Integer
> > cannot be cast to java.lang.String)
> > My uniqieKey for my docs it the "id"(sint).
> > These here was the error message:
> >
> >
> > Problem accessing /solr/select/. Reason:
> >
> >Carrot2 clustering failed
> >
> > org.apache.solr.common.SolrException: Carrot2 clustering failed
> >at
> >
> org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:217)
> >at
> >
> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
> >at
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
> >at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> >at
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >at
> > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >at
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >at
> > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> >at
> > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> >at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> >at
> >
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> >at
> >
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >at
> > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> >at org.mortbay.jetty.Server.handle(Server.java:326)
> >at
> > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> >at
> >
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> >at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> >at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> >at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> >at
> >
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
> >at
> >
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> > Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast
> > to java.lang.String
> >at
> >
> org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.getDocuments(CarrotClusteringEngine.java:364)
> >at
> >
> org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:201)
> >... 23 more
> >
> > It this case it's better for me to upgrade/patch the 1.4.1 version.
> >
> > Best regards
> > Vadim
> >
> >
> >
> >
> > 2011/11/29 Stanislaw Osinski 
> >
> >> >
> >> > But my actual live system works on solr 1.4.1. i can only change my
> >> > solrconfig.xml and integrate new packages...
> >> > i check the possibility to upgrade from 1.4.1 to 3.5 with the same
> index
> >> > (without reinidex) with luceneMatchVersion 2.9.
> >> > i hope it works...
> >> >
> >>
> >> Another option would be to check out Solr 1.4.1 source code, fix the
> issue
> >> and recompile the clustering component. The quick and dirty way would be
> >> to
> >> convert all identifiers to strings in the clustering component, before
> the
> >> they are returned for serialization (I can send you a patch that does
> >> this). The proper way would be to fix the root cause of the problem, but
> >> I'd need to dig deeper into the code to find this.
> >>
> >> Staszek
> >>
> >
> >
>


Re: Weird docs-id clustering output in Solr 1.4.1

2011-11-29 Thread Stanislaw Osinski
>
> But my actual live system works on solr 1.4.1. i can only change my
> solrconfig.xml and integrate new packages...
> i check the possibility to upgrade from 1.4.1 to 3.5 with the same index
> (without reinidex) with luceneMatchVersion 2.9.
> i hope it works...
>

Another option would be to check out Solr 1.4.1 source code, fix the issue
and recompile the clustering component. The quick and dirty way would be to
convert all identifiers to strings in the clustering component, before the
they are returned for serialization (I can send you a patch that does
this). The proper way would be to fix the root cause of the problem, but
I'd need to dig deeper into the code to find this.

Staszek


Re: Weird docs-id clustering output in Solr 1.4.1

2011-11-29 Thread Stanislaw Osinski
Hi,

It looks like some serialization issue related to writing integer ids to
the output. I've just tried a similar configuration on Solr 3.5 and the
integer identifiers looked fine. Can you try the same configuration on Solr
3.5?

Thanks,

Staszek

On Tue, Nov 29, 2011 at 12:03, Vadim Kisselmann  wrote:

> Hi folks,
> i've installed the clustering component in solr 1.4.1 and it works, but not
> really:)
>
> You can see what the doc id is corrupt.
>
> 
> 
> Euro-Krise
> 
> ½Íџ
> ¾౥ͽ
> ¿)ై
> ˆ࡯׸
> 
>
> my fields:
> 
>  required="true"/>
>  required="true"/>
>  multiValued="true" compressed="true"/>
>
> and my config-snippets:
> title
>  id
>  
>  text
>
> i changed my config snippets (carrot.url=id, url, title..) but the
> result is the same.
> anyone an idea?
>
> best regards and thanks
> vadim
>


Re: Clustering and FieldType

2011-11-25 Thread Stanislaw Osinski
Hi,

You're right -- currently Carrot2 clustering ignores the Solr analysis
chain and uses its own pipeline. It is possible to integrate with Solr's
analysis components to some extent, see the discussion here:
https://issues.apache.org/jira/browse/SOLR-2917.

Staszek


> > Hi
> > Trying to use carrot2 for clustering search results. I have it setup
> except it seems to treat the field as regular text instead of applying some
> custom filters I have.
> >
> > So my schema says something like
> >  omitNorms="true"/>
> >  compressed="true"/>
> >
> > ic_text is our internal fieldtype with some custom analysers that strip
> out certain special characters from the text.
> >
> > My solrconfig has something like this setup in our default search
> handler.
> > true
> > default
> > true
> > 
> > title
> > 
> > content
> >
> > In my search results, I see clusters but the labels on these clusters
> have the special characters in them - which means that the clustering must
> be running on raw text and not on the "ic_text" field.
> > Can someone let me know if this is the default setup and if there is a
> way to fix this ?
> > Thanks !
> > Geetu
> >
>


Re: Clustering not working when using 'text' field as snippet.

2011-08-12 Thread Stanislaw Osinski
Hi Pablo,

The reason clustering doesn't work with the "text" field is that the field
is not stored:

 

For clustering to work, you'll need to keep your documents' titles and
content in stored fields.

Staszek


On Fri, Aug 12, 2011 at 10:28, Pablo Queixalos  wrote:

> Hi,
>
>
>
>
>
> I am using solr-3.3.0 and carrot² clustering which works fine out of the
> box with the examples doc and default solr configuration (the 'features'
> Field is used as snippet).
>
>
>
> I indexed my own documents using the embed ExtractingRequestHandler wich by
> default stores contents in the 'text' Field. When configuring clustering on
> 'text' as snippet, carrot doesn't work fine and only shows 'Other topics'
> with all the documents within. It looks like carrot doesn't get the 'text'
> Field stored content.
>
>
>
>
>
> If I store the documents content in the 'features' field and get back to
> the original configuration clustering works fine.
>
>
>
> The only difference I see between 'text' and 'features' Fields in
> schema.xml is that some CopyFields are defined for 'text'.
>
>
>
>
>
> I didn't debug solr.clustering.ClusteringComponent nor
> CarrotClusteringEngine yet, but am I misunderstanding something about the
> 'text' Field ?
>
>
>
>
>
> Thanks,
>
>
>
> Pablo.
>
>


Re: How to use solr clustering to show in search results

2011-07-01 Thread Stanislaw Osinski
The "docs" array contained in each cluster contains ids of documents
belonging to the cluster, so for each id you need to look up the document's
content, which comes earlier in the response (in the response/docs array).

Cheers,

Staszek

On Thu, Jun 30, 2011 at 11:50, Romi  wrote:

> wanted to use clustering in my search results, i configured solr for
> clustering and i got following json for clusters. But i am not getting how
> to use it to show in search results. as corresponding to one doc i have
> number of fields and up till now i am showing name, description and id. now
> in clusters i have labels and doc id. then how to use my docs in clusters,
> i
> am really confused what to do Please reply.
>
> *
> "clusters":[
>
>{
>   "labels":[
>   "Complement any Business Casual or Semi-formal
> Attire"
>],
>   "docs":[
>"7799",
>"7801"
>]
>  },
>{
>   "labels":[
>"Design"
>],
>   "docs":[
>"8252",
>"7885"
>]
>  },
>{
>   "labels":[
>"Elegant Ring has an Akoya Cultured Pearl"
>],
>   "docs":[
>"8142",
>"8139"
>]
>  },
>{
>   "labels":[
>"Feel Amazing in these Scintillating Earrings
> Perfect"
>],
>   "docs":[
>"12250",
>"12254"
>]
>  },
>{
>   "labels":[
>"Formal Evening Attire"
>],
>   "docs":[
>"8151",
>"8004"
>]
>  },
>{
>   "labels":[
>"Pave Set"
>],
>   "docs":[
>"7788",
>"8169"
>]
>  },
>{
>   "labels":[
>"Subtle Look or Layer it or Attach"
>],
>   "docs":[
>"8014",
>"8012"
>]
>  },
>   {
>   "labels":[
>"Three-stone Setting is Elegant and Fun"
>],
>   "docs":[
>"8335",
>"8337"
>]
>  },
>{
>   "labels":[
>"Other Topics"
>],
>   "docs":[
>"8038",
>"7850",
>"7795",
>"7989",
>"7797"
>]
>  {
>]*
>
>
> -
> Thanks & Regards
> Romi
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-use-solr-clustering-to-show-in-search-results-tp3125149p3125149.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Multicore clustering setup problem

2011-07-01 Thread Stanislaw Osinski
Hi Walter,

That makes sense, but this has always been a multi-core setup, so the paths
> have not changed, and the clustering component worked fine for core0. The
> only thing new is I have fine tuned core1 (to begin implementing it).
> Previously the solrconfig.xml file was very basic. I replaced it with
> core0's solrconfig.xml and made very minor changes to it (unrelated to
> clustering) - it's a nearly identical solrconfig.xml file so I'm surprised
> it doesn't work for core1.
>

I'd probably need to take a look at the whole Solr dir you're working with,
clearly there's something wrong with the classpath of core1.

Again, I'm wondering if perhaps since both cores have the clustering
> component, if it should have a shared configuration in a different file
> used
> by both cores(?). Perhaps the duplicate clusteringComponent configuration
> for both cores is the problem?
>

I'm not an expert on Solr's internals related to core management, but I once
did configure two cores with search results clustering, where clustering
configuration and s were specified for each core separately, so this is
unlikely to be a problem. Another approach would be to put all the JARs
required for clustering in a common directory and point Solr to that lib
using the sharedLib attribute in the  tag:
http://wiki.apache.org/solr/CoreAdmin#solr. But it really should work both
ways.

If you can somehow e-mail (off-list) the listing of your Solr directory and
contents of your configuration XMLs, I may be able to trace the problem for
you.

Cheers,

Staszek


Re: Solr Clustering For Multiple Pages

2011-07-01 Thread Stanislaw Osinski
>
> I am asking about the  filter after clustering . Faceting   is based on the
> single field so,if we need to filter we can search in related field .  But
> in clustering it is created by multiple field  then how can we create a
> filter for that.
>
> Example
>
> after clusetring you get the following
>
> Model(20)
> System(15)
> Other Topics(5)
>
> if i will click on Model then i should get  record associated with Model
>

I'm not sure what you mean by "filter" -- ids of documents belonging to each
cluster are part of the response, see the "docs" array inside the cluster
(see http://wiki.apache.org/solr/ClusteringComponent#Quick_Start for example
output). When the user clicks a cluster, you just need to show the documents
with ids specified inside the cluster the user clicked.

Cheers,

Staszek


Re: Multicore clustering setup problem

2011-06-30 Thread Stanislaw Osinski
It looks like the whole clustering component JAR is not in the classpath. I
remember that I once dealt with a similar issue in Solr 1.4 and the cause
was the relative path of the  tag being resolved against the core's
instanceDir, which made the path incorrect when directly copying and pasting
from the single core configuration. Try correcting the relative  paths
or replacing them with absolute ones, it should solve the problem.

Cheers,

Staszek


Re: Multicore clustering setup problem

2011-06-29 Thread Stanislaw Osinski
Hi,

Can you post the full strack trace? I'd need to know if it's
really org.apache.solr.handler.clustering.ClusteringComponent that's missing
or some other class ClusteringComponent depends on.

Cheers,

Staszek

On Thu, Jun 30, 2011 at 04:19, Walter Closenfleight <
walter.p.closenflei...@gmail.com> wrote:

> I had set up the clusteringComponent in solrconfig.xml for my first core.
> It
> has been working fine and now I want to get my next core working. I set up
> the second core with the clustering component so that I could use it, use
> solritas properly, etc. but Solr did not like the solrconfig.xml changes
> for
> the second core. I'm getting this error when Solr is started or when I hit
> a
> Solr related URL:
>
> SEVERE: org.apache.solr.common.SolrException: Error loading class
> 'org.apache.solr.handler.clustering.ClusteringComponent'
>
> Should the clusteringComponent be set up in a shared configuration file
> somehow or is there something else I am doing wrong?
>
> Thanks in advance!
>


Re: what is solr clustering component

2011-06-29 Thread Stanislaw Osinski
>
> and my second question is does clustering effect indexes.
>

No, it doesn't. Clustering is performed only on the search results produced
by Solr, it doesn't change anything in the index.

Cheers,

Staszek


Re: Solr Clustering For Multiple Pages

2011-06-22 Thread Stanislaw Osinski
I don't quite follow, I must admit. Maybe it's faceting you're after?

http://wiki.apache.org/solr/SolrFacetingOverview

Staszek

On Wed, Jun 22, 2011 at 08:40, nilay@gmail.com wrote:

> Can you please tell me how can i apply filter in cluster data  in Solr  ?
>
> Currently i  storing docid and topic name in Map and get the ids  by topic
> from Map and then pass into solr separating by OR condition
>
> Is there any other way to do this
>
>
>
> -
> Regards
> Nilay Tiwari
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Clustering-For-Multiple-Pages-tp3085507p3094390.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr Clustering For Multiple Pages

2011-06-21 Thread Stanislaw Osinski
Hi,

Currently, only the clustering of search results is implemented in Solr,
clustering of the whole index is not possible out of the box. In other
words, clustering applies only to the records you fetch during searching.
For example, if you set rows=10, only the 10 returned documents will be
clustered. You can try setting larger rows values (e.g. 100, 200, 500) to
get more clusters.

Staszek

On Mon, Jun 20, 2011 at 11:36, nilay@gmail.com wrote:

> Hi
>
> How can i create cluster for all records.
> Currently i  am sending clustering=true  param to solr  and it give the
> cluster in  response ,
> but it give for 10 rows because  rows=10 . So please suggest me how can i
> get the cluster for all records .
>
> How can i search with in cluster .
>
>  e.g  cluster created
>   Model(20)
>   Test(10)
>
> if i click on Model the i should get 20 records by filter so please  give
> me
> idea about   this .
>
>
> Please help me  to resolve this problem
>
> Regards
> Nilay Tiwari
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Clustering-For-Multiple-Pages-tp3085507p3085507.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Mahout & Solr

2011-06-15 Thread Stanislaw Osinski
>
> Is it possible to use the clustering component to use predefined clusters
> generated by Mahout?


Actually, the existing Solr ClusteringComponent's API has been designed to
deal with both search results clustering (implemented by Carrot2) and
off-line clustering of the whole index. The latter has not yet been
implemented, so the API is very likely to change depending on the specific
design decisions (should clustering be triggered through Solr or
externally?, should the clusters be stored in Solr?, how to handle new
documents?, how to use the clusters at search time?).

I can also imagine a simpler approach based on a search results clustering
"algorithm" that would simply fetch Mahout's predefined clusters for each
document being returned in the search result. Getting this to work is a
matter of implementing a dedicated
http://lucene.apache.org/solr/api/org/apache/solr/handler/clustering/SearchClusteringEngine.html
and
should be fairly straightforward, at least in terms of interaction with
Solr.

Staszek


Re: solr 3.1 java.lang.NoClassDEfFoundError org/carrot2/core/ControllerFactory

2011-06-08 Thread Stanislaw Osinski
Hi Bryan,

You'll also need to make sure the your
${solr.dir}/contrib/clustering/lib directory is in the classpath; that
directory contains the Carrot2 JARs that
provide the classes you're missing. I think the example solrconfig.xml
has the relevant  declarations.

Cheers,

S.

On Tue, Jun 7, 2011 at 13:48, bryan rasmussen wrote:

> As per the subject I am getting java.lang.NoClassDEfFoundError
> org/carrot2/core/ControllerFactory
> when I try to run clustering.
>
> I am using Solr 3.1:
>
> I get the following error:
>
> java.lang.NoClassDefFoundError: org/carrot2/core/ControllerFactory
>at
> org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.(CarrotClusteringEngine.java:74)
>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
> Source)
>at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
>at java.lang.reflect.Constructor.newInstance(Unknown Source)
>at java.lang.Class.newInstance0(Unknown Source)
>at java.lang.Class.newInstance(Unknown Source)
>at
> org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:412)
>at
> org.apache.solr.handler.clustering.ClusteringComponent.inform(ClusteringComponent.java:203)
>at
> org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:522)
>at org.apache.solr.core.SolrCore.(SolrCore.java:594)
>at org.apache.solr.core.CoreContainer.create(CoreContainer.java:458)
>at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
>at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
>at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130)
>at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94)
>at
> org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
>at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
>at
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
>at
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
>at
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
>at
> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
>at org.mortbay.jetty.Server.doStart(Server.java:224)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>at java.lang.reflect.Method.invoke(Unknown Source)
>at org.mortbay.start.Main.invokeMain(Main.java:194)
>at org.mortbay.start.Main.start(Main.java:534)
>at org.mortbay.start.Main.start(Main.java:441)
>at org.mortbay.start.Main.main(Main.java:119)
> Caused by: java.lang.ClassNotFoundException:
> org.carrot2.core.ControllerFactory
>at java.net.URLClassLoader$1.run(Unknown Source)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.net.URLClassLoader.findClass(Unknown Source)
>at java.lang.ClassLoader.loadClass(Unknown Source)
>at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
>
> using the following configuration
>
>
>   class="org.apache.solr.handler.clustering.ClusteringComponent"
> name="clustering">
>  
>default
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm
>
>
>20
>  
> 
>   class="org.apache.solr.handler.component.SearchHandler">
>
>  explicit
>
>
>title
>all_text
>all_text title
> 
> 150
>  
>  
>clustering
>  
> 
>
>
>
> with the following command  to start solr
> java -Dsolr.clustering.enabled=true
> -Dsolr.solr.home="C:\projects\solrexample\solr" -jar start.jar
>
> Any idea as to why crusty is not working?
>
> Thanks,
> Bryan Rasmussen
>


Re: solr 3.1 java.lang.NoClassDEfFoundError org/carrot2/core/ControllerFactory

2011-06-08 Thread Stanislaw Osinski
Hi Bryan,

You'll also need to make sure the your ${solr.home}/contrib/clustering/lib
directory is in the classpath; that directory contains the Carrot2 JARs that
provide the classes you're missing. I think the example solrconfig.xml has
the relevant  declarations.

Cheers,

S.

On Tue, Jun 7, 2011 at 13:48, bryan rasmussen wrote:

> As per the subject I am getting java.lang.NoClassDEfFoundError
> org/carrot2/core/ControllerFactory
> when I try to run clustering.
>
> I am using Solr 3.1:
>
> I get the following error:
>
> java.lang.NoClassDefFoundError: org/carrot2/core/ControllerFactory
>at
> org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.(CarrotClusteringEngine.java:74)
>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
> Source)
>at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
>at java.lang.reflect.Constructor.newInstance(Unknown Source)
>at java.lang.Class.newInstance0(Unknown Source)
>at java.lang.Class.newInstance(Unknown Source)
>at
> org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:412)
>at
> org.apache.solr.handler.clustering.ClusteringComponent.inform(ClusteringComponent.java:203)
>at
> org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:522)
>at org.apache.solr.core.SolrCore.(SolrCore.java:594)
>at org.apache.solr.core.CoreContainer.create(CoreContainer.java:458)
>at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
>at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
>at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130)
>at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94)
>at
> org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
>at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
>at
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
>at
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
>at
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
>at
> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
>at org.mortbay.jetty.Server.doStart(Server.java:224)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>at java.lang.reflect.Method.invoke(Unknown Source)
>at org.mortbay.start.Main.invokeMain(Main.java:194)
>at org.mortbay.start.Main.start(Main.java:534)
>at org.mortbay.start.Main.start(Main.java:441)
>at org.mortbay.start.Main.main(Main.java:119)
> Caused by: java.lang.ClassNotFoundException:
> org.carrot2.core.ControllerFactory
>at java.net.URLClassLoader$1.run(Unknown Source)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.net.URLClassLoader.findClass(Unknown Source)
>at java.lang.ClassLoader.loadClass(Unknown Source)
>at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
>
> using the following configuration
>
>
>   class="org.apache.solr.handler.clustering.ClusteringComponent"
> name="clustering">
>  
>default
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm
>
>
>20
>  
> 
>   class="org.apache.solr.handler.component.SearchHandler">
>
>  explicit
>
>
>title
>all_text
>all_text title
> 
> 150
>  
>  
>clustering
>  
> 
>
>
>
> with the following command  to start solr
> java -Dsolr.clustering.enabled=true
> -Dsolr.solr.home="C:\projects\solrexample\solr" -jar start.jar
>
> Any idea as to why crusty is not working?
>
> Thanks,
> Bryan Rasmussen
>


Re: assit with the Clustering component in Solr/Lucene

2011-05-16 Thread Stanislaw Osinski
>
> Both of the clustering algorithms that ship with Solr (Lingo and STC) are
>> designed to allow one document to appear in more than one cluster, which
>> actually does make sense in many scenarios. There's no easy way to force
>> them to produce hard clusterings because this would require a complete
>> change in the way the algorithms work. If you need each document to belong
>> to exactly one cluster, you'd have to post-process the clusters to remove
>> the redundant document assignments.
>>
>
> On the second thought, I have a simple implementation of k-means clustering
> that could do hard clustering for you. It's not available yet, it will most
> probably be part of the next major release of Carrot2 (the package that does
> the clustering). Please watch this issue
> http://issues.carrot2.org/browse/CARROT-791 to get updates on this.
>

Just to let you know: Carrot2 3.5.0 has landed in Solr trunk and branch_3x,
so you can use the bisecting k-means clustering algorithm
(org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm) which
will produce non-overlapping clusters for you. The downside of this simple
implementation of k-means is that, for the time being, it produces one-word
cluster labels rather than phrases as Lingo and STC.

Cheers,

S.


Re: assit with the Clustering component in Solr/Lucene

2011-03-31 Thread Stanislaw Osinski
Thanks for the confirmation, I'll take a look at the issue.

S.

On Thu, Mar 31, 2011 at 17:24,  wrote:

> That did make a difference, I now see the exact number of cluster i see
> from the workbench.
> I am of course interested in why the config changes did not have much
> effect. However, I am happy that by adding the threshold to my request URL
> produces the desired results
>
> let me know if I can do any more tests and I will do so. Thanks much
>
> Ramdev
>
>
>
> On Mar 31, 2011, at 10:18 AM, Stanislaw Osinski wrote:
>
>
>  I added the parameter as you suggested.
>> (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent
>> section that describes the Clustering module
>> Changing the value of the parameter  did not have any effect on my search
>> results.
>>
>> However, when I used the Carrot2 workbench, I could see the effect of
>> changing the value. (from 6 clusters it went down to 2 clusters)
>>
>
> Interesting... Can you, for the sake of debugging, append
> &LingoClusteringAlgorithm.clusterMergingThreshold=0.0 to your request URL?
>
> S.
>
>
>


Re: assit with the Clustering component in Solr/Lucene

2011-03-31 Thread Stanislaw Osinski
>  I added the parameter as you suggested.
> (LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent
> section that describes the Clustering module
> Changing the value of the parameter  did not have any effect on my search
> results.
>
> However, when I used the Carrot2 workbench, I could see the effect of
> changing the value. (from 6 clusters it went down to 2 clusters)
>

Interesting... Can you, for the sake of debugging, append
&LingoClusteringAlgorithm.clusterMergingThreshold=0.0 to your request URL?

S.


Re: assit with the Clustering component in Solr/Lucene

2011-03-30 Thread Stanislaw Osinski
> Both of the clustering algorithms that ship with Solr (Lingo and STC) are
> designed to allow one document to appear in more than one cluster, which
> actually does make sense in many scenarios. There's no easy way to force
> them to produce hard clusterings because this would require a complete
> change in the way the algorithms work. If you need each document to belong
> to exactly one cluster, you'd have to post-process the clusters to remove
> the redundant document assignments.
>

On the second thought, I have a simple implementation of k-means clustering
that could do hard clustering for you. It's not available yet, it will most
probably be part of the next major release of Carrot2 (the package that does
the clustering). Please watch this issue
http://issues.carrot2.org/browse/CARROT-791 to get updates on this.

Cheers,

S.


Re: assit with the Clustering component in Solr/Lucene

2011-03-30 Thread Stanislaw Osinski
Hi Ramdev,

Both of the clustering algorithms that ship with Solr (Lingo and STC) are
designed to allow one document to appear in more than one cluster, which
actually does make sense in many scenarios. There's no easy way to force
them to produce hard clusterings because this would require a complete
change in the way the algorithms work. If you need each document to belong
to exactly one cluster, you'd have to post-process the clusters to remove
the redundant document assignments. Alternatively, in case of the Lingo
algorithm, you can try lowering the
"LingoClusteringAlgorithm.clusterMergingThreshold" to some value in the
range of 0.2--0.5. If you do that, clusters containing overlapping documents
will get merged. For more information about this attribute, see here:
http://download.carrot2.org/stable/manual/#section.attribute.LingoClusteringAlgorithm.clusterMergingThreshold
.

Cheers,

Staszek

On Wed, Mar 30, 2011 at 18:21, Markus Jelsma wrote:

> Yes, you can set engine specific parameters. Check the comments in your
> snippety.
>
> > Hi:
> >   I recently included the CLustering component into Solr and updated the
> > requestHandler accordingly (in solrconfig.xml). Snippet of the Config for
> > the CLuserting:
> >
> >> name="clusteringComponent"
> > enable="${solr.clustering.enabled:false}"
> > class="org.apache.solr.handler.clustering.ClusteringComponent" >
> > 
> > 
> >   
> >   default
> >   
> >>
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgori
> > thm 
> >name="LingoClusteringAlgorithm.desiredClusterCountBase">20
> > 
> > 
> >   stc
> >>
> name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm<
> > /str> 
> >   
> >
> > snippet of the Config for requestHandler
> >> default="true"> 
> >  
> >explicit
> >
> >true
> >default
> >true
> >
> >headline
> >pi
> >
> >headline
> >
> >true
> >
> >
> >
> >false
> >  
> > 
> >   clusteringComponent
> > 
> >   
> >
> >
> > When I perform a search, I see that the Cluster section within the Solr
> > results shows me results that are not quite consistent. There are two
> > documents that are reported in two different documents
> >
> > Are there parameters that can be set that will prevent this from
> happening
> > ?
> >
> >
> > Thanks much
> >
> > Ramdev
>


Re: Carrot2 clustering component

2011-01-18 Thread Stanislaw Osinski
Hi,

I think the exception is caused by the fact that you're trying to use the
latest version of Carrot2 with Solr 1.4.x. There are two alternative
solutions here:

* as described in http://wiki.apache.org/solr/ClusteringComponent,
invoke "ant get-libraries"
to get the compatible JAR files.

or

* use the latest version of Carrot2 with Solr 1.4.x by installing the
compatibility package, documentation is here:
http://download.carrot2.org/stable/manual/#section.solr

Cheers,

Staszek


On Tue, Jan 18, 2011 at 13:36, Isha Garg  wrote:

> Hi,
>Can anyone help me to solve the error:
> Class org.carrot2.util.pool.SoftUnboundedPool does not implement the
> requested interface org.carrot2.util.pool.IParameterizedPool
>at
> org.carrot2.core.PoolingProcessingComponentManager.(PoolingProcessingComponentManager.java:77)
>at
> org.carrot2.core.PoolingProcessingComponentManager.(PoolingProcessingComponentManager.java:62)
>at org.carrot2.core.ControllerFactory.create(ControllerFactory.java:158)
>at
> org.carrot2.core.ControllerFactory.createPooling(ControllerFactory.java:71)
>at
> org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.(CarrotClusteringEngine.java:61)
>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>at java.lang.Class.newInstance0(Class.java:355)
>at java.lang.Class.newInstance(Class.java:308)
>at
> org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:396)
>at
> org.apache.solr.handler.clustering.ClusteringComponent.inform(ClusteringComponent.java:121)
>at
> org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:486)
>at org.apache.solr.core.SolrCore.(SolrCore.java:588)
>at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
>at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
>at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at
> org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
>at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
>at
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
>at
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
>at
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
>at
> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
>at org.mortbay.jetty.Server.doStart(Server.java:210)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.mortbay.start.Main.invokeMain(Main.java:183)
>at org.mortbay.start.Main.start(Main.java:497)
>at org.mortbay.start.Main.main(Main.java:115)
> 18 Jan, 2011 6:03:30 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.IncompatibleClassChangeError: Class
> org.carrot2.util.pool.SoftUnboundedPool does not implement the requested
> interface org.carrot2.util.pool.IParameterizedPool
>at
> org.carrot2.core.PoolingProcessingComponentManager.(PoolingProcessingComponentManager.java:77)
>at
> org.carrot2.core.PoolingProcessingComponentManager.(PoolingProcessingComponentManager.java:62)
>at org.carrot2.core.ControllerFactory.create(ControllerFactory.java:158)
>at
> org.carrot2.core.ControllerFactory.createPooling(ControllerFactory.java:71)
>at
> org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.(CarrotClusteringEngine.java:61)
>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>at
> sun.refl

Re: specifying the doc id in clustering component

2010-08-19 Thread Stanislaw Osinski
> The solr schema has the fields, id,  name and desc.
>
>  I would like to get docs:["name Field here" ] instead of the doc Id
> field as in
> "docs":["200066", "195650",
>

The idea behind using the document ids was that based on them you could
access the individual documents' content, including the other fields, right
from the "response" field. Using ids limits duplication in the response text
as a whole. Is it possible to use this approach in your application?

Staszek


Re: specifying the doc id in clustering component

2010-08-18 Thread Stanislaw Osinski
Hi Tommy,

 I'm using the clustering component with solr 1.4.
>
> The response is given by the id field in the doc array like:
>"labels":["Devices"],
>"docs":["200066",
> "195650",
> "204850",
> Is there a way to change the doc label to be another field?
>
> i couldn't this option in http://wiki.apache.org/solr/ClusteringComponent


I'm not sure if I get you right. The "labels" field is generated by the
clustering engine, it's a description of the group (cluster) of documents.
The description is usually a phrase or a number of phrases. The "docs" field
lists the ids of documents that the algorithm assigned to the cluster.

Can you give an example of the input and output you'd expect?

Thanks!

Stanislaw


Re: clustering component

2010-07-28 Thread Stanislaw Osinski
> The patch should also work with trunk, but I haven't verified it yet.
>

I've just added a patch against solr trunk to
https://issues.apache.org/jira/browse/SOLR-1804.

S.


Re: clustering component

2010-07-27 Thread Stanislaw Osinski
Hi Matt,

I'm attempting to get the carrot based clustering component (in trunk) to
> work. I see that the clustering contrib has been disabled for the time
> being. Does anyone know if this will be re-enabled soon, or even better,
> know how I could get it working as it is?
>

I've recently created a patch to update the clustering algorithms in
branch_3x:

https://issues.apache.org/jira/browse/SOLR-1804

The patch should also work with trunk, but I haven't verified it yet.

S.


Re: Clustering results limit?

2010-07-22 Thread Stanislaw Osinski
Hi,

In my SolrJ, I used ModifiableSolrParams and I set ("rows",50) but it
> still returns less than 10 for each cluster.
>

Oh, the number of documents per cluster very much depends on the
characteristics of your documents, it often happens that the algorithms
create larger numbers of smaller clusters. However, all returned documents
should get assigned to some cluster(s), the Other Topics one in the worst
case. Does that hold in your case?

If you'd like to tune clustering a bit, you can try Carrot2 tools:

http://download.carrot2.org/stable/manual/#section.getting-started.solr

and then:

http://download.carrot2.org/stable/manual/#chapter.tuning

Cheers,

S.


Re: Clustering results limit?

2010-07-22 Thread Stanislaw Osinski
Hi,

 I am attempting to cluster a query. It kinda works, but where my
> (regular) query returns 500 results the cluster only shows 1-10 hits for
> each cluster (5 clusters). Never more than 10 docs and I know its not
> right. What could be happening here? It should be showing dozens of
> documents per cluster.
>

Just to clarify -- how many documents do you see in the response ( section)? Clustering is performed on the search results
(in real time), so if you request 10 results, clustering will apply only to
those 10 results. To get a larger number of clusters you'd need to request
more results, e.g. 50, 100, 200 etc. Obviously, the trade-off here is that
it will take longer to fetch the documents from the index, clustering time
will also increase. For some guidance on choosing the clustering algorithm,
you can take a look at the following section of Carrot2 manual:
http://download.carrot2.org/stable/manual/#section.advanced-topics.fine-tuning.choosing-algorithm
.

Cheers,

Staszek


[ANN] Carrot2 3.3.0 released

2010-04-19 Thread Stanislaw Osinski
Dear All,

We're pleased to announce the 3.3.0 release of Carrot2 which significantly
improves the scalability of the clustering algorithms (up to 7x times faster
clustering in case of the STC algorithm) and fixes a number of minor issues.

Release notes:
http://project.carrot2.org/release-3.3.0-notes.html

Download:
http://download.carrot2.org

JIRA issues:
http://issues.carrot2.org/secure/IssueNavigator.jspa?jqlQuery=project+%3D+CARROT+AND+fixVersion+%3D+%223.3.0%22+ORDER+BY+priority+DESC%2C+key+DESC


Similar improvements are available in Lingo3G, the real-time document
clustering engine from Carrot Search.


Thanks!

Dawid Weiss, Stanislaw Osinski
Carrot Search, i...@carrot-search.com


Re: Clustering Search taking 4sec for 100 results

2010-03-05 Thread Stanislaw Osinski
Hi,

It might be also interesting to add some logging of clustering time (just
filed: https://issues.apache.org/jira/browse/SOLR-1809) to see what the
index search vs clustering proportions are.

Cheers,

S.

On Fri, Mar 5, 2010 at 03:26, Erick Erickson wrote:

> Search time is only partially dependent on the
> number of results returned. Far more important
> is the number of docs in the index, the
> complexity of the query, any sorting you do, etc.
>
> So your question isn't really very answerable,
> you need to provide many more details. Things
> like your index size, the machine you're operating
> on etc.
>
> Are you firing off warmup queries? Also, using
> debugQuery=on on your URL will provide
> significant timing output, that would help us
> diagnose your issues.
>
> HTH
> Erick
>
>
>
> On Thu, Mar 4, 2010 at 9:02 PM, Allahbaksh Asadullah <
> allahbaks...@gmail.com
> > wrote:
>
> > Hi,
> > I am using Solr for clustering. I am have set number of row as 100 and I
> am
> > using clustering handler. The problem is that I am getting the search
> time
> > for clustering search roughly 4sec. I have set -Xmx1024m. What is the
> best
> > way to reduce the time.
> > Regards,
> > allahbaksh
> >
>


Re: Clustering from anlayzed text instead of raw input

2010-03-05 Thread Stanislaw Osinski
>  I'll give a try to stopwords treatbment, but the problem is that we
> perform
> POS tagging and then use payloads to keep only Nouns and Adjectives, and we
> thought that could be interesting to perform clustering only with these
> elements, to avoid senseless words.
>

POS tagging could help a lot in clustering (not yet implemented in Carrot2
though), but ideally, we'd need to have POS tags attached to the original
tokenized text (so each token would be a tuple along the lines of: raw_text
+ stemmed + POS). If we have just nouns and adjectives, cluster labels will
be most likely harder to read (e.g. because of missing prepositions). I'm
not too familiar with Solr internals, but I'm assuming this type of
representation should be possible to implement using payloads? Then, we
could refactor Carrot2 a bit to work either on raw text or on the
tokenized/augmented representation.

Cheers,

S.


Re: Clustering from anlayzed text instead of raw input

2010-03-03 Thread Stanislaw Osinski
Hi Joan,

I'm trying to use  carrot2 (now I started with the workbench) and I can
> cluster any field, but, the text used for clustering is the original raw
> text, the one that was indexed, without any of the processing performed by
> the tokenizer or filters.
> So I get stop words.
>

The easiest way to fix this is to update the stop words list used by
Carrot2, see http://wiki.apache.org/solr/ClusteringComponent, "Tuning
Carrot2 clustering" section at the bottom. If you want to get readable
cluster labels, it's best to feed the raw text for clustering (cluster
labels are phrases taken from the input text, if you remove stopwords and
stem everything, the phrases will become unreadable).

Cheers,

Staszek


[ANN] Carrot2 3.2.0 released

2010-03-03 Thread Stanislaw Osinski
Dear All,

I'm happy to announce three releases from the Carrot Search team: Carrot2
v3.2.0, Lingo3G v1.3.1 and Carrot Search Labs.


Carrot2 is an open source search results clustering engine. Version v3.2.0
introduces:

* experimental support for clustering Korean and Arabic content,
* a command-line batch processing application,
* significant updates to the Flash-based cluster visualization.

As of version 3.2.0, Carrot2 is free of LGPL-licensed dependencies.

Release notes:
http://project.carrot2.org/release-3.2.0-notes.html

Download:
http://project.carrot2.org/download.html



Lingo3G is a real-time document clustering engine from Carrot Search.
Version 1.3.1 introduces support for clustering Arabic, Danish, Finnish,
Hungarian, Korean, Romanian, Swedish and Turkish content, a command-line
application and a number of minor improvements. Please contact us at
i...@carrotsearch.com for details.



Carrot Search Labs shares some small pieces of software we created when
working on Carrot2 and Lingo3G. Please see http://labs.carrotsearch.com for
details and downloads.



Thanks!

Dawid Weiss, Stanislaw Osinski
Carrot Search, i...@carrot-search.com


Re: Has anyone got Carrot2 working with Solr without using ant?

2010-01-02 Thread Stanislaw Osinski
> You need, in addition to the ones shipped:
> http://repo1.maven.org/maven2/colt/colt/1.2.0/colt-1.2.0.jar
> http://download.carrot2.org/maven2/org/carrot2/nni/1.0.0/nni-1.0.0.jar
>
> http://mirrors.ibiblio.org/pub/mirrors/maven2/org/simpleframework/simple-xml/1.7.3/simple-xml-1.7.3.jar
> http://repo1.maven.org/maven2/pcj/pcj/1.2/pcj-1.2.jar
>
> These all go in the contrib/clustering/lib/downloads directory.  From
> there, the example should just work.
>

A quick heads up: our development server is down for maintenance today, so
the temporary location of the NNI JAR is here:

http://www.carrot2.org/download/maven2/org/carrot2/nni/1.0.0/nni-1.0.0.jar

Apologies for the problem,

S.


[ANN] Carrot2 version 3.1.0 released

2009-09-29 Thread Stanislaw Osinski
Dear All,

[Apologies for cross-posting.]

This is just to let you know that we've released version 3.1.0 of Carrot2
Search Results Clustering Engine.

The 3.1.0 release comes with:

* Experimental support for clustering Chinese Simplified content (based on
Lucene's Smart Chinese Analyzer)
* Document Clustering Workbench usability improvements
* Suffix Tree Clustering algorithm rewritten for better performance and
clustering quality
* Apache Solr clustering plugin (to be available in Solr 1.4, Grant's blog
post:
http://www.lucidimagination.com/blog/2009/09/28/solrs-new-clustering-capabilities/
)

Release notes:
http://project.carrot2.org/release-3.1.0-notes.html

On-line demo:
http://search.carrot2.org

Download:
http://download.carrot2.org

Project website:
http://project.carrot2.org


Thanks,

Staszek

--
Stanislaw Osinski, http://carrot2.org


Re: SOLR-769 clustering

2009-09-08 Thread Stanislaw Osinski
Hi,

It seems like the problem can be on two layers: 1) getting the right
contents of stop* files for Carrot2, 2) making sure Solr picks up the
changes.

I tried your quick and dirty hack too. It didn't work also. phase like
> "Carbon Atoms in the Group" with "in" still appear in my clustering labels.
>

Here most probably layer 1) applies: if you add "in" to stopwords, the Lingo
algorithm (Carrot2's default) will still create labels with "in" inside, but
will not create labels starting / ending in "in". If you'd like to eliminate
"in" completely, you'd need to put an appropriate regexp in stoplabels.*.

For more details, please see Carrot2 manual:

http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps

The easiest way to tune the stopwords and see their impact on clusters is to
use Carrot2 Document Clustering Workbench (see
http://wiki.apache.org/solr/ClusteringComponent).


> What i did is,
>
> 1. use "java uf carrot2-mini.jar stoplabels.en" command to replace the
> stoplabel.en file.
> 2. apply clustering patch. re-complie the solr with the new
> carrot2-mini.jar.
> 3. deploy the new apache-solr-1.4-dev.war to tomcat.
>

Once you make sure the changes to stopwords.* and stoplabels.* have the
desired effect on clusters, the above procedure should do the trick. You can
also put the modified files in WEB-INF/classes of the WAR, if that's any
easier.

For your reference, I've updated
http://wiki.apache.org/solr/ClusteringComponent to contain a procedure
working with the Jetty starter distributed in Solr's examples folder.


>  class="org.apache.solr.handler.clustering.ClusteringComponent"
> name="clustering">
>  
>default
>
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm
>20
>0.150
> name="carrot.lingo.threshold.candidateClusterThreshold">0.775
>

Not really related to your issue, but the above file looks a little outdated
-- the two parameters:"carrot.lingo.threshold.clusterAssignment" and
"carrot.lingo.threshold.candidateClusterThreshold" are not there anymore
(but there are many others:
http://download.carrot2.org/stable/manual/#section.component.lingo). For
most up to date examples, please see
http://wiki.apache.org/solr/ClusteringComponent and solrconfig.xml in
contrib\clustering\example\conf.

Cheers,

Staszek


Re: SOLR-769 clustering

2009-09-08 Thread Stanislaw Osinski
Hi there,

I try to apply the stoplabels with the instructions that you given in the
> solr clustering Wiki. But it didn't work.
>
> I am runing the patched solr on tomcat. So to enable the stop label. I add
> "-cp " in to my system's CATALINA_OPTS. I
> tried to change the file name from stoplabels.txt to stoplabel.en also . It
> didn't work too.
>
> Then I also find out that in carrot manual page
> (
>
> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
> ).
> It suggested to edit the stopwords files inside the carrot2-core.jar. I
> tried this but it didn't work too.
>
> I am not sure what is wrong with my set up. will it be caused by any sort
> of
> caching?
>

A quick and dirty hack would be to simply replace the corresponding files
(stoplabels.*) in carrot2-mini.jar.

I know the packaging of the clustering contrib has changed a bit, so let me
see how it currently works and correct the wiki if needed.

Thanks,

Staszek


Re: Solr 1.4 Clustering / mlt AS search?

2009-08-15 Thread Stanislaw Osinski
Hi,

On Thu, Aug 13, 2009 at 19:29, Mark Bennett  wrote:

There are comments in the Solr materials about having an option to cluster
> based on the entire document set, and some warning about this being
> atypical
> and possibly slow.  And from what you're saying, for a big enough docset,
> it
> might go from "slow" to "impossible", I'm not sure.


For Carrot2, it would go to "impossible" I'd say. But as Grant mentioned
earlier, Mahout is developing clustering algorithms that should be able to
handle the whole-index types of docsets.

And so my question was, *if* you were willing to spend that much time and
> effort to cluster all the text of all the documents (and if it were even
> possible), would the result perform better than the standard TF/IDF
> techniques?


Depends on the algorithm, really. In case of Carrot2, we don't do re-ranking
of documents within clusters, we simply use whatever document order we got
on input. As far as I'm aware, most clustering algorithms do pretty much the
same: they concentrate on finding groups of documents and don't delve much
into the issues of ranking documents within clusters.


> In the application I'm considering, the queries tend to be longer than
> average, more like full sentences or more.  And they tend to be of a
> question and answer nature.  I've seen references in several search engines
> that QandA search sometimes benefits from alternative search techniques.
> And, from a separate email, the IDF part of the standard similarity may be
> causing a problem, so I'm casting a wide net for other ideas.  Just
> brainstorming here... :-)


Because of what I described above, clustering the whole index may not give
you the best results. But you can try something different. You could try
fetching a bunch (100--500) of more or less relevant documents for the
question (MLT should be fine to start with), add your question as an extra
document, perform clustering and see where the question-document ends up. If
it doesn't end up in the Other Topics cluster, you could examine if the
other documents from the cluster give an answer to the question. In this
scenario, Carrot2 should be fine, at least performance-wise. I've not
followed the QA literature very closely, so it's hard to say what the
results would be quality-wise, but it should be very quick to try. Carrot2
Clustering Workbench [1][2] may come in handy for the experiments too.

S.

[1] http://download.carrot2.org/head/manual/#section.workbench
[2]
http://download.carrot2.org/head/manual/#section.getting-started.xml-files


Re: Solr 1.4 Clustering / mlt AS search?

2009-08-13 Thread Stanislaw Osinski
Hi,

On Tue, Aug 11, 2009 at 22:19, Mark Bennett  wrote:

Carrot2 has several pluggable algorithms to choose from, though I have no
> evidence that they're "better" than Lucene's.  Where TF/IDF is sort of a
> one
> step algebraic calculation, some clustering algorithms use iterative
> approaches, etc.


I'm not sure if I completely follow the way in which you'd like to use
Carrot2 for scoring -- would you cluster the whole index? Carrot2 was
designed to be a post-retrieval clustering algorithm and optimized to
cluster small sets of documents (up to ~1000) in real time. All processing
is performed in-memory, which limits Carrot2's applicability to really large
sets of documents.

S.


Re: Faceting on text fields

2009-06-12 Thread Stanislaw Osinski
Hi,

Sorry for being late to the party, let me try to clear some doubts about
Carrot2.

Do you know under what circumstances or application should we cluster the
> whole corpus of documents vs just the search results?


I think it depends on what you're trying to achieve. If you'd like to give
the users some alternative way of exploring the search results by organizing
them into semantically related groups (search results clustering), Carrot2
would be the appropriate tool. Its algorithms are designed to work with
small input (up to ~1000 results) and try to provide meaningful labels for
each cluster. Currently, Carrot2 has two algorithms: an implementation of
Suffix Tree Clustering (STC, a classic in search results clustering
research, designed by O. Zamir, implemented by Dawid Weiss) and Lingo
(designed and implemented by myself). STC is very fast compared to Lingo,
but the latter will usually get you better clusters. Some comparison of the
algorithms is here: http://project.carrot2.org/algorithms.html, but
ultimately, I'd encourage you to experiment (e.g. using Clustering
Workbench). For best results, I'd recommend feeding the algorithms with
contextual snippets generated based on the user's query. If the summary
could consist of complete sentence(s) containing the query (as opposed to
individual words delimited by "..."), you should be getting even nicer
labels.

One important thing for search results clustering is that it is done
on-line, so it will add extra time to each search query your server handles.
Plus, to get reasonable clusters, you'd need to fetch at least 50 documents
from your index, which may put more load on the disks as well (sometimes
clustering time may be only be a fraction of the time required to get the
documents from the index).

Finally, to compare search results clustering with facets: UI-wise they may
look similar, but I'd say they're two different things that complement each
other. While the list of facets and their values is fairly static (brand
names etc.), clusters are less "stable" -- they're generated dynamically for
each search and will vary across queries. Plus, as for any other
unsupervised machine learning technique, your clusters will never be 100%
correct (as opposed to facets). Almost always you'll be getting one or two
clusters that don't make much sense.

When it comes to clustering the whole collection, it might be useful in a
couple of scenarios: a) if you wanted to get some high level overview of
what's in your collection, b) if you'd wanted to e.g. use clusters to
re-rank the search results presented to the user (implicit clustering:
showing a few documents from each cluster), c) if you wanted to distribute
your index based on the semantics of the documents (wild guess, I'm not sure
if anyone tried that in practice). In general, I feel clustering the whole
index is much harder than search results clustering not only because of the
different scale, but also because you'd need to tune the algorithm for your
specific needs and data. For example, in scenario a) and a collection of 1M
documents: how many top level clusters do you generate? 10? 1? If it's
10, the clusters may end up too general / meaningless, it might be hard to
describe them concisely. If it's 1, clusters are likely to be more
focused, but hard to browse... I must admit I haven't followed Mahout too
closely, maybe there is some nice way of resolving these problems.

If you have any other questions about Carrot2, I'll try to answer them here.
Alternatively, feel free to join Carrot2 mailing lists.

Thanks,

Staszek

--
http://www.carrot2.org


Re: questions about Clustering

2009-05-23 Thread Stanislaw Osinski
>
> Hmm, I saw the comment in ClusteringDocumentList.java of Carrot2:
>
> /*
> * If you know what query generated the documents you're about to cluster,
> pass
> * the query to the algorithm, which will usually increase clustering
> quality.
> */
> attributes.put(AttributeNames.QUERY, "data mining");
>
> So I'm worried about clustering quality when Carrot2 got string
> "MatchAllDocsQuery".


The query is just a hint, without the query you should still be able to get
decent clusters (at least for English, we've not tested Carrot2 much with
Japanese).

Cheers,

Staszek


Re: questions about Clustering

2009-05-23 Thread Stanislaw Osinski
>
> 1. if q=*:* is requested, Carrot2 will receive "MatchAllDocsQuery"
>> via attributes. Is it OK?
>>
>
> Yes, it only clusters on the Doc List, not the Doc Set (in other words,
> it's your rows that matter)


Just to add to that: Carrot2 should be able to cluster up to ~1000 search
results, but by design it won't be able to process significantly more
documents than that. The reason is that Carrot2 is a search results
clustering engine and performs all processing in-memory.

 2. I'd like to use it on an environment other than English, e.g. Japanese.
>> I've implemented Carrot2JapaneseAnalyzer (w/ Payload/ITokenType)
>> for this purpose.
>> It worked well with ClusteringDocumentList example, but didn't
>> work with CarrotClusteringEngine.
>>
>> What I did is that I inserted the following lines(+) to
>> CarrotClusteringEngine:
>>
>> attributes.put(AttributeNames.QUERY, query.toString());
>> + attributes.put(AttributeUtils.getKey(Tokenizer.class, "analyzer"),
>> + Carrot2JapaneseAnalyzer.class);
>>
>> There is no runtime errors, but Carrot2 didn't use my analyzer,
>> it just ignored and used ExtendedWhitespaceAnalyzer (confirmed via
>> debugger).
>> Is it classloader problem? I placed my jar in ${solr.solr.home}/lib .
>>
>
>
> Hmmm, I'm not sure if the Carrot guys are on this list (they are on dev).
>  Can you share a simple example on the JIRA issue and we can discuss there?


Yep, we're here too :-)

The catch with analyzer is that this specific attribute is an
initialization-time attribute, so you need to add it to the initAttributes
map in the init() method of CarrotClusteringEngine.

Please let me know if this solves the problem. If not, I'll investigate
further.

Cheers,

Staszek


Re: clustering SOLR-769

2009-05-22 Thread Stanislaw Osinski
Hi there,


> Is it possbile to specify more than one snippet field or should I use copy
> field to copy copy two or three field into single field and specify it in
> snippet field.


Currently, you can specify only one snippet field, so you'd need to use
copy.

Cheers,

S.


Re: clustering SOLR-769

2009-05-21 Thread Stanislaw Osinski
Hi.


> I built Solr from SVN today morning. I am using Clustering example. I
> have added my own schema.xml.
>
> The problem is the even though I change carrot.snippet field from
> features to filecontent the clustering results are not changed a bit.
> Please note features field is also there in my document.
>
>   name
>   
>   features
>   id
>
> Why I get the same cluster even though I have changed the
> carrot.snippet. Whether there is some problem with my understarnding?


If you get back to the clustering dir in examples and change

features

to

manu

do you see any change in clusters?

Cheers,

Staszek

--
http://carrot2.org


Re: SOLR-769 clustering

2009-04-24 Thread Stanislaw Osinski
>
> How would we enable people via SOLR-769 to do this?


Good point, Grant! To apply the modified stopwords.* and stoplabels.* files
to Solr, simply make them available in the classpath. For the example Solr
runner scripts that would be something like:

java -cp 
-Dsolr.solr.home=./clustering/solr -jar start.jar

I've documented the whole tuning procedure on the Wiki:

http://wiki.apache.org/solr/ClusteringComponent

Cheers,

S.


Re: SOLR-769 clustering

2009-04-22 Thread Stanislaw Osinski
Hi Antonio,


> To answer your question in terms of minimum term is, I am working with
> "joke text" very short in length so the clusters are not so meaning full.. I
> mean lot of adverbs and nouns, I thought increasing it might give me less
> cluster but bit more meaningful (maybe not).


Clustering this type of content (jokes, blogs) is tricky for Carrot2
algorithms, mostly because such input contains relatively little
"informative" words (nouns, noun phrases) which are good for cluster labels,
and more narrative ones (verbs, adjectives), which usually don't lead to
meaningful labels / clusters.

So I think the way to go would be to tune the clustering algorithm's stop
words / stop label dictionaries to exclude the labels you don't like. I
can't guarantee you can get decent clusters with this technique, but it's
worth giving a try.

Here's how to do that:

1. Download Carrot2 Clustering Workbench from:
http://project.carrot2.org/download.html
2. Attach your Solr instance as a document source:
http://download.carrot2.org/head/manual/#section.getting-started.solr
3. Try tuning the stop words / labels to get more meaningful labels:
http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning

For more advice you may want to post your questions on Carrot2 forum:
http://project.carrot2.org/forum.html.

Hope that helps.

Cheers,

Staszek


Re: SOLR-769 clustering

2009-04-21 Thread Stanislaw Osinski
Hi Antonio,

- is there anyway to have minimum number of labels per cluster?


The current search results clustering algorithms (from Carrot2) by design
generate one label per cluster, so there is no way to force them to create
more. What is the reason you'd like to have more labels per cluster?

I'd leave the other two Solr-related questions to answer by a more competent
person (Grant?).

Cheers,

Staszek


Re: solr + carrot2

2007-08-27 Thread Stanislaw Osinski
Hi Lance and all,

I've just implemented a configuration UI for Solr, similar to the one we
have for Lucene. The new UI is available in the HEAD version of the browser:

http://demo.carrot2.org/head/dist/carrot2-demo-browser-head.zip

or through WebStart:

http://demo.carrot2.org/head/webstart/

Please let us know if the new UI works for you.

Thanks,

Staszek

On 20/08/07, Lance Norskog <[EMAIL PROTECTED]> wrote:
>
> Exactly! The Lucene version requires direct access to the file. Our
> indexes
> are on servers which do not have graphics (VNC) configured.
>
> A generic Solr access UI would be great.
>
> Lance
>
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Stanislaw
> Osinski
> Sent: Saturday, August 18, 2007 2:23 AM
> To: solr-user@lucene.apache.org
> Subject: Re: solr + carrot2
>
> Hi Lance,
>
> The Lucene interface is cool, but not many people put their indexes on
> > machines with Swing access.
> >
> > I just did a Solr integration by copying the eTools.ch implementation.
> > This
> > took several edits. As long as we're making requests, please do a
> > general-pupose implementation by cloning the Lucene implementation.
>
>
> I'm not sure if I'm getting you right here... By "implementation" do you
> mean adding to the Swing application an option for pulling data from Solr
> (with a configuration dialog for Solr URL etc.)?
>
> Thanks,
>
> Stanislaw
>
>


Re: solr + carrot2

2007-08-18 Thread Stanislaw Osinski
Hi Lance,

The Lucene interface is cool, but not many people put their indexes on
> machines with Swing access.
>
> I just did a Solr integration by copying the eTools.ch implementation.
> This
> took several edits. As long as we're making requests, please do a
> general-pupose implementation by cloning the Lucene implementation.


I'm not sure if I'm getting you right here... By "implementation" do you
mean adding to the Swing application an option for pulling data from Solr
(with a configuration dialog for Solr URL etc.)?

Thanks,

Stanislaw


Re: solr + carrot2

2007-08-17 Thread Stanislaw Osinski
Hi All,

I've just filed an issue for us related to this:

http://issues.carrot2.org/browse/CARROT-106

I'll try to find some spare cycles to look into it, hopefully in some not
too distant future. Meanwhile, feel free to post your thoughts and concerns
on this either here or on our JIRA.

Thanks,

Stanislaw

-- 
Stanislaw Osinski, [EMAIL PROTECTED]
http://www.carrot-search.com

On 17/08/07, Pieter Berkel <[EMAIL PROTECTED]> wrote:
>
> Any updates on this?  It certainly would be quite interesting to see how
> well carrot2 clustering can be integrated with solr, I suppose it's a
> fairly
> similar concept to simple faceting (maybe another candidate for SOLR-281
> component?).
>
> One concern I have is that the additional processing required at query
> time
> would make the whole operation significant slower (which is something I'd
> like to avoid).  I've been wondering if it might be possible to calculate
> (and store) clustering information at index time
> however since carrot2 seems to use the query term & result set to create
> clustering info this doesn't appear to be a practical approach.
>
> In a similar vein, I'm also looking at methods of term extraction and
> automatic keyword generation from indexed documents.  I've been
> experimenting with MoreLikeThis and values returned by the "
> mlt.interestingTerms" parameter, which has potential but needs a bit of
> refinement before it can be truely useful.  Has anybody else discovered
> clever or useful methods of term extraction using solr?
>
> Piete
>
>
>
> On 02/08/07, Burkamp, Christian <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> >
> > In my opinion the results from carrot2 clustering could be used in the
> > same way that facet results are used.
> > That's the way I'm planning to use them.
> > The user of the search application can narrow the search by selecting
> one
> > of the facets presented in the search result presentation. These facets
> > could come from metadata (classic facets) or from dynamically computed
> > categories which are results from carrot2.
> >
> > From this point of view it would be most convenient to have the
> > integration for carrot2 directly in the StandardRequestHandler. This
> leaves
> > questions open like "how should filters for categories from carrot2 be
> > formulated".
> >
> > Is anybody already using carrot2 with solr?
> >
> > -- Christian
> >
> > -Ursprüngliche Nachricht-
> > Von: [EMAIL PROTECTED] [mailto: [EMAIL PROTECTED] Im Auftrag von
> > Stanislaw Osinski
> > Gesendet: Mittwoch, 1. August 2007 14:01
> > An: solr-user@lucene.apache.org
> > Betreff: Re: solr + carrot2
> >
> >
> > >
> > > Has anyone looked into using carrot2 clustering with solr?
> > >
> > > I know this is integrated with nutch:
> > >
> > > http://lucene.apache.org/nutch/apidocs/org/apache/nutch/clustering/car
> > > rot2/Clusterer.html
> > >
> > > It looks like carrot has support to read results from a solr index:
> > >
> > > http://demo.carrot2.org/head/api/org/carrot2/input/solr/package-summar
> > > y.html
> > >
> > > But I'm hoping for something that returns clustered results from solr.
> > >
> > > Carrot also has something to read lucene indexes:
> > >
> > > http://demo.carrot2.org/head/api/org/carrot2/input/lucene/package-summ
> > > ary.html
> > >
> > > Any pointers or experience before I (may) delve into this?
> > >
> >
> > First of all, apologies for a delayed response. I'm one of Carrot2
> > developers and indeed we did some Solr integration, but from Carrot2's
> > perspective, which I guess will not be directly useful in this case. If
> you
> > have any ideas for integration, questions or requests for
> changes/patches,
> > feel free to post on Carrot2 mailing list or file an issue for us.
> >
> > Thanks,
> >
> > Staszek
> >
>


[release announcement] Carrot2 version 2.1 released

2007-08-13 Thread Stanislaw Osinski
Hi All,

A bit of self-promotion again :) I hope you don't find it out of topic,
after all, some folks are using Carrot2 with Lucene and Solr, and Nutch has
a Carrot2-based clustering plugin.

Staszek
[EMAIL PROTECTED]



Carrot2 Search Results Clustering Engine version 2.1 released

Version 2.1 of the Java-based Open Source Search Results Clustering Engine
called Carrot2 has been released. Carrot2 can fetch search results from a
variety of sources and automatically organize (cluster) them into thematic
categories using one of its specialized search results clustering
algorithms.

The 2.1 release comes with the Document Clustering Server that exposes
Carrot2 clustering as an XML-RPC or REST service with convenient XML or JSON
data formats enabling e.g. quick PHP, .NET or Ruby integration. The new
release also adds new search results sources and many other improvements (
http://project.carrot2.org/release-2.1-notes.html).

At the same time Carrot Search, the Carrot2 spin-off company, released
version 1.2 of Lingo3G -- a high-performance document clustering engine
offering hierarchical clustering, synonyms, label filtering and advanced
tuning capabilities.

For more information, please check

Carrot2 live demo -- http://www.carrot2.org
Carrot2 project website -- http://project.carrot2.org
Release 2.1 notes -- http://project.carrot2.org/release-2.1-notes.html

Carrot Search -- http://www.carrot-search.com


Re: solr + carrot2

2007-08-01 Thread Stanislaw Osinski
>
> Has anyone looked into using carrot2 clustering with solr?
>
> I know this is integrated with nutch:
>
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/clustering/carrot2/Clusterer.html
>
> It looks like carrot has support to read results from a solr index:
>
> http://demo.carrot2.org/head/api/org/carrot2/input/solr/package-summary.html
>
> But I'm hoping for something that returns clustered results from solr.
>
> Carrot also has something to read lucene indexes:
>
> http://demo.carrot2.org/head/api/org/carrot2/input/lucene/package-summary.html
>
> Any pointers or experience before I (may) delve into this?
>

First of all, apologies for a delayed response. I'm one of Carrot2
developers and indeed we did some Solr integration, but from Carrot2's
perspective, which I guess will not be directly useful in this case. If you
have any ideas for integration, questions or requests for changes/patches,
feel free to post on Carrot2 mailing list or file an issue for us.

Thanks,

Staszek