Re: Faceting on text fields

2009-06-04 Thread Yonik Seeley
Are you using Solr 1.3?
You might want to try the latest 1.4 test build - faceting has changed a lot.

-Yonik
http://www.lucidimagination.com

On Thu, Jun 4, 2009 at 12:01 PM, Yao Ge  wrote:
>
> I am index a database with over 1 millions rows. Two of fields contain
> unstructured text but size of each fields is limited (256 characters).
>
> I come up with an idea to use visualize the text fields using text cloud by
> turning the two text fields in facets. The weight of font and size is of
> each facet value (words) derived from the facet counts. I used simpler field
> type so that the there is no stemming to these facet values:
>    >
>      
>        
>         ignoreCase="true" expand="false"/>
>         words="stopwords.txt"/>
>         generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>        
>        
>      
>    
>
> The facet query is considerably slower comparing to other facets from
> structured database fields (with highly repeated values). What I found
> interesting is that even after I constrained search results to just a few
> hunderd hits using other facets, these text facets are still very slow.
>
> I understand that text fields are not good candidate for faceting as it can
> contain very large number of unique values. However why it is still slow
> after my matching documents is reduced to hundreds? Is it because the whole
> filter is cached (regardless the matching docs) and I don't have enough
> filter cache size to fit the whole list?
>
> The following is my filterCahce setting:
>      autowarmCount="128"/>
>
> Lastly, what I really want to is to give user a chance to visualize and
> filter on top relevant words in the free-text fields. Are there alternative
> to facet field approach? term vectors? I can do client side process based on
> top N (say 100) hits for this but it is my last option.
> --
> View this message in context: 
> http://www.nabble.com/Faceting-on-text-fields-tp23872891p23872891.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Faceting on text fields

2009-06-04 Thread Yao Ge

Yes. I am using 1.3. When is 1.4 due for release?


Yonik Seeley-2 wrote:
> 
> Are you using Solr 1.3?
> You might want to try the latest 1.4 test build - faceting has changed a
> lot.
> 
> -Yonik
> http://www.lucidimagination.com
> 
> On Thu, Jun 4, 2009 at 12:01 PM, Yao Ge  wrote:
>>
>> I am index a database with over 1 millions rows. Two of fields contain
>> unstructured text but size of each fields is limited (256 characters).
>>
>> I come up with an idea to use visualize the text fields using text cloud
>> by
>> turning the two text fields in facets. The weight of font and size is of
>> each facet value (words) derived from the facet counts. I used simpler
>> field
>> type so that the there is no stemming to these facet values:
>>    > positionIncrementGap="100"
>>>
>>      
>>        
>>        > ignoreCase="true" expand="false"/>
>>        > words="stopwords.txt"/>
>>        > generateWordParts="0" generateNumberParts="0" catenateWords="1"
>> catenateNumbers="1" catenateAll="0"/>
>>        
>>        
>>      
>>    
>>
>> The facet query is considerably slower comparing to other facets from
>> structured database fields (with highly repeated values). What I found
>> interesting is that even after I constrained search results to just a few
>> hunderd hits using other facets, these text facets are still very slow.
>>
>> I understand that text fields are not good candidate for faceting as it
>> can
>> contain very large number of unique values. However why it is still slow
>> after my matching documents is reduced to hundreds? Is it because the
>> whole
>> filter is cached (regardless the matching docs) and I don't have enough
>> filter cache size to fit the whole list?
>>
>> The following is my filterCahce setting:
>>     > autowarmCount="128"/>
>>
>> Lastly, what I really want to is to give user a chance to visualize and
>> filter on top relevant words in the free-text fields. Are there
>> alternative
>> to facet field approach? term vectors? I can do client side process based
>> on
>> top N (say 100) hits for this but it is my last option.
>> --
>> View this message in context:
>> http://www.nabble.com/Faceting-on-text-fields-tp23872891p23872891.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Faceting-on-text-fields-tp23872891p23876051.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Faceting on text fields

2009-06-09 Thread Michael Ludwig

Yao Ge schrieb:


The facet query is considerably slower comparing to other facets from
structured database fields (with highly repeated values). What I found
interesting is that even after I constrained search results to just a
few hunderd hits using other facets, these text facets are still very
slow.

I understand that text fields are not good candidate for faceting as
it can contain very large number of unique values. However why it is
still slow after my matching documents is reduced to hundreds? Is it
because the whole filter is cached (regardless the matching docs) and
I don't have enough filter cache size to fit the whole list?


Very interesting questions! I think an answer would both require and
further an understanding of how filters work, which might even lead to
a more general guideline on when and how to use filters and facets.

Even though faceting appears to have changed in 1.4 vs 1.3, it would
still be interesting to understand the 1.3 side of things.


Lastly, what I really want to is to give user a chance to visualize
and filter on top relevant words in the free-text fields. Are there
alternative to facet field approach? term vectors? I can do client
side process based on top N (say 100) hits for this but it is my last
option.


Also a very interesting data mining question! I'm sorry I don't have any
answers for you. Maybe someone else does.

Best,

Michael Ludwig


Re: Faceting on text fields

2009-06-09 Thread Michael Ludwig

Yonik Seeley schrieb:

Are you using Solr 1.3?
You might want to try the latest 1.4 test build -
faceting has changed a lot.


I found two significant changes (but there may well be more):

[#SOLR-911] multi-select facets - ASF JIRA
https://issues.apache.org/jira/browse/SOLR-911

Yao,

it sounds like the following (which is in 1.4) might have a chance of
helping your faceting performance issue:

[#SOLR-475] multi-valued faceting via un-inverted field - ASF JIRA
https://issues.apache.org/jira/browse/SOLR-475

Yonik,

from your initial comment for SOLR-475:

| * To save space and speed up faceting, any term that matches enough
| * documents will not be un-inverted... it will be skipped while
| * building the un-inverted field structore, and will use a set
| * intersection method during faceting.

Does this mean that frequently occurring terms (which we can use for
faceting in 1.3 without problems) are handled exactly as they were
before, by allocating a slot in the filter cache upon request, while
those zillions of pesky little fringe terms outside the mainstream,
for which allocating a slot in the filter cache would be overkill
(and possibly cause inefficient contention, eviction, and, hence,
a performance penalty) are now handled by the new structure mapping
documents to term numbers?

So doing faceting for a given set of documents would result in (a) doing
set intersection using those filter query results that have been set up
(for the terms occurring in many documents), and (b) collecting all the
pesky little terms from the new structure mapping documents to term
numbers?

So basically, depending on expediency, you (a) know the facets and count
the documents which display them, or you (b) take the documents and see
what facets they have?

Michael Ludwig


Re: Faceting on text fields

2009-06-09 Thread Yonik Seeley
Yep, all that sounds right.
An additional optimization counts terms for the documents *not* in the
set when the base set is over half the size of the index.

-Yonik
http://www.lucidimagination.com


On Tue, Jun 9, 2009 at 1:01 PM, Michael Ludwig  wrote:
> Yonik,
>
> from your initial comment for SOLR-475:
>
> | * To save space and speed up faceting, any term that matches enough
> | * documents will not be un-inverted... it will be skipped while
> | * building the un-inverted field structore, and will use a set
> | * intersection method during faceting.
>
> Does this mean that frequently occurring terms (which we can use for
> faceting in 1.3 without problems) are handled exactly as they were
> before, by allocating a slot in the filter cache upon request, while
> those zillions of pesky little fringe terms outside the mainstream,
> for which allocating a slot in the filter cache would be overkill
> (and possibly cause inefficient contention, eviction, and, hence,
> a performance penalty) are now handled by the new structure mapping
> documents to term numbers?
>
> So doing faceting for a given set of documents would result in (a) doing
> set intersection using those filter query results that have been set up
> (for the terms occurring in many documents), and (b) collecting all the
> pesky little terms from the new structure mapping documents to term
> numbers?
>
> So basically, depending on expediency, you (a) know the facets and count
> the documents which display them, or you (b) take the documents and see
> what facets they have?
>
> Michael Ludwig
>


Re: Faceting on text fields

2009-06-09 Thread Yao Ge

Michael,

Thanks for the update! I definitely need to get a 1.4 build see if it makes
a difference.

BTW, maybe instead of using faceting for text
mining/clustering/visualization purpose, we can build a separate feature in
SOLR for this. Many of commercial search engines I have experiences with
(Google Search Appliance, Vivisimo etc) provide dynamic term clustering
based on top N ranked documents (N is a parameter can be configured). When
facet field is highly fragmented (say a text field), the existing set
intersection based approach might no longer be optimum. Aggregating term
vectors over top N docs might be more attractive. Another features I can
really appreciate is to provide search time n-gram term clustering. Maybe
this might be better suited for "spell checker" as it just a different way
to display the alternative search terms.

-Yao


Michael Ludwig-4 wrote:
> 
> Yao Ge schrieb:
> 
>> The facet query is considerably slower comparing to other facets from
>> structured database fields (with highly repeated values). What I found
>> interesting is that even after I constrained search results to just a
>> few hunderd hits using other facets, these text facets are still very
>> slow.
>>
>> I understand that text fields are not good candidate for faceting as
>> it can contain very large number of unique values. However why it is
>> still slow after my matching documents is reduced to hundreds? Is it
>> because the whole filter is cached (regardless the matching docs) and
>> I don't have enough filter cache size to fit the whole list?
> 
> Very interesting questions! I think an answer would both require and
> further an understanding of how filters work, which might even lead to
> a more general guideline on when and how to use filters and facets.
> 
> Even though faceting appears to have changed in 1.4 vs 1.3, it would
> still be interesting to understand the 1.3 side of things.
> 
>> Lastly, what I really want to is to give user a chance to visualize
>> and filter on top relevant words in the free-text fields. Are there
>> alternative to facet field approach? term vectors? I can do client
>> side process based on top N (say 100) hits for this but it is my last
>> option.
> 
> Also a very interesting data mining question! I'm sorry I don't have any
> answers for you. Maybe someone else does.
> 
> Best,
> 
> Michael Ludwig
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Faceting-on-text-fields-tp23872891p23950084.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Faceting on text fields

2009-06-09 Thread Otis Gospodnetic

Yao,

Solr can already cluster top N hits using Carrot2:
http://wiki.apache.org/solr/ClusteringComponent

I've also done ugly "manual counting" of terms in top N hits.  For example, 
look at the right side of this:
http://www.simpy.com/user/otis/tag/%22machine+learning%22

Something like http://www.sematext.com/product-key-phrase-extractor.html could 
also be used.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Yao Ge 
> To: solr-user@lucene.apache.org
> Sent: Tuesday, June 9, 2009 3:46:13 PM
> Subject: Re: Faceting on text fields
> 
> 
> Michael,
> 
> Thanks for the update! I definitely need to get a 1.4 build see if it makes
> a difference.
> 
> BTW, maybe instead of using faceting for text
> mining/clustering/visualization purpose, we can build a separate feature in
> SOLR for this. Many of commercial search engines I have experiences with
> (Google Search Appliance, Vivisimo etc) provide dynamic term clustering
> based on top N ranked documents (N is a parameter can be configured). When
> facet field is highly fragmented (say a text field), the existing set
> intersection based approach might no longer be optimum. Aggregating term
> vectors over top N docs might be more attractive. Another features I can
> really appreciate is to provide search time n-gram term clustering. Maybe
> this might be better suited for "spell checker" as it just a different way
> to display the alternative search terms.
> 
> -Yao
> 
> 
> Michael Ludwig-4 wrote:
> > 
> > Yao Ge schrieb:
> > 
> >> The facet query is considerably slower comparing to other facets from
> >> structured database fields (with highly repeated values). What I found
> >> interesting is that even after I constrained search results to just a
> >> few hunderd hits using other facets, these text facets are still very
> >> slow.
> >>
> >> I understand that text fields are not good candidate for faceting as
> >> it can contain very large number of unique values. However why it is
> >> still slow after my matching documents is reduced to hundreds? Is it
> >> because the whole filter is cached (regardless the matching docs) and
> >> I don't have enough filter cache size to fit the whole list?
> > 
> > Very interesting questions! I think an answer would both require and
> > further an understanding of how filters work, which might even lead to
> > a more general guideline on when and how to use filters and facets.
> > 
> > Even though faceting appears to have changed in 1.4 vs 1.3, it would
> > still be interesting to understand the 1.3 side of things.
> > 
> >> Lastly, what I really want to is to give user a chance to visualize
> >> and filter on top relevant words in the free-text fields. Are there
> >> alternative to facet field approach? term vectors? I can do client
> >> side process based on top N (say 100) hits for this but it is my last
> >> option.
> > 
> > Also a very interesting data mining question! I'm sorry I don't have any
> > answers for you. Maybe someone else does.
> > 
> > Best,
> > 
> > Michael Ludwig
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/Faceting-on-text-fields-tp23872891p23950084.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Faceting on text fields

2009-06-10 Thread Michael Ludwig

Yonik Seeley schrieb:

Yep, all that sounds right.
An additional optimization counts terms for the documents *not* in the
set when the base set is over half the size of the index.


Cool :-) Thanks for confirming my assumptions!

Michael Ludwig


Re: Faceting on text fields

2009-06-10 Thread Michael Ludwig

Otis Gospodnetic schrieb:


Solr can already cluster top N hits using Carrot2:
http://wiki.apache.org/solr/ClusteringComponent


Would it be fair to say that clustering as detailed on the page you're
referring to is a kind of dynamic faceting? The faceting not being done
based on distinct values of certain fields, but on the presence (and
frequency) of terms in one field?

The main difference seems to be that with faceting, grouping criteria
(facets) are known beforehand, while with clustering, grouping criteria
(the significant terms which create clusters - the cluster keys) have
yet to be determined. Is that a correct assessment?

Michael Ludwig


Re: Faceting on text fields

2009-06-10 Thread Yao Ge

Thanks for insight Otis. I have no awareness of ClusteringComponent until
now. It is time to move to Solr 1.4

-Yao

Otis Gospodnetic wrote:
> 
> 
> Yao,
> 
> Solr can already cluster top N hits using Carrot2:
> http://wiki.apache.org/solr/ClusteringComponent
> 
> I've also done ugly "manual counting" of terms in top N hits.  For
> example, look at the right side of this:
> http://www.simpy.com/user/otis/tag/%22machine+learning%22
> 
> Something like http://www.sematext.com/product-key-phrase-extractor.html
> could also be used.
> 
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> - Original Message 
>> From: Yao Ge 
>> To: solr-user@lucene.apache.org
>> Sent: Tuesday, June 9, 2009 3:46:13 PM
>> Subject: Re: Faceting on text fields
>> 
>> 
>> Michael,
>> 
>> Thanks for the update! I definitely need to get a 1.4 build see if it
>> makes
>> a difference.
>> 
>> BTW, maybe instead of using faceting for text
>> mining/clustering/visualization purpose, we can build a separate feature
>> in
>> SOLR for this. Many of commercial search engines I have experiences with
>> (Google Search Appliance, Vivisimo etc) provide dynamic term clustering
>> based on top N ranked documents (N is a parameter can be configured).
>> When
>> facet field is highly fragmented (say a text field), the existing set
>> intersection based approach might no longer be optimum. Aggregating term
>> vectors over top N docs might be more attractive. Another features I can
>> really appreciate is to provide search time n-gram term clustering. Maybe
>> this might be better suited for "spell checker" as it just a different
>> way
>> to display the alternative search terms.
>> 
>> -Yao
>> 
>> 
>> Michael Ludwig-4 wrote:
>> > 
>> > Yao Ge schrieb:
>> > 
>> >> The facet query is considerably slower comparing to other facets from
>> >> structured database fields (with highly repeated values). What I found
>> >> interesting is that even after I constrained search results to just a
>> >> few hunderd hits using other facets, these text facets are still very
>> >> slow.
>> >>
>> >> I understand that text fields are not good candidate for faceting as
>> >> it can contain very large number of unique values. However why it is
>> >> still slow after my matching documents is reduced to hundreds? Is it
>> >> because the whole filter is cached (regardless the matching docs) and
>> >> I don't have enough filter cache size to fit the whole list?
>> > 
>> > Very interesting questions! I think an answer would both require and
>> > further an understanding of how filters work, which might even lead to
>> > a more general guideline on when and how to use filters and facets.
>> > 
>> > Even though faceting appears to have changed in 1.4 vs 1.3, it would
>> > still be interesting to understand the 1.3 side of things.
>> > 
>> >> Lastly, what I really want to is to give user a chance to visualize
>> >> and filter on top relevant words in the free-text fields. Are there
>> >> alternative to facet field approach? term vectors? I can do client
>> >> side process based on top N (say 100) hits for this but it is my last
>> >> option.
>> > 
>> > Also a very interesting data mining question! I'm sorry I don't have
>> any
>> > answers for you. Maybe someone else does.
>> > 
>> > Best,
>> > 
>> > Michael Ludwig
>> > 
>> > 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Faceting-on-text-fields-tp23872891p23950084.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Faceting-on-text-fields-tp23872891p23965401.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Faceting on text fields

2009-06-10 Thread Otis Gospodnetic

I'd call it related (their application in search encourages exploration), but 
also distinct enough to never mix them up.  I think your assessment below is 
correct, although I'm not familiar with the details of Carrot2 any more (was 
once), so I can't tell you exactly which algo is used under the hood.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Michael Ludwig 
> To: solr-user@lucene.apache.org
> Sent: Wednesday, June 10, 2009 9:41:54 AM
> Subject: Re: Faceting on text fields
> 
> Otis Gospodnetic schrieb:
> >
> > Solr can already cluster top N hits using Carrot2:
> > http://wiki.apache.org/solr/ClusteringComponent
> 
> Would it be fair to say that clustering as detailed on the page you're
> referring to is a kind of dynamic faceting? The faceting not being done
> based on distinct values of certain fields, but on the presence (and
> frequency) of terms in one field?
> 
> The main difference seems to be that with faceting, grouping criteria
> (facets) are known beforehand, while with clustering, grouping criteria
> (the significant terms which create clusters - the cluster keys) have
> yet to be determined. Is that a correct assessment?
> 
> Michael Ludwig



Re: Faceting on text fields

2009-06-11 Thread Yao Ge

FYI. I did a direct integration with Carrot2 with Solrj with a separate Ajax
call from UI for top 100 hits to clusters terms in the two text fields. It
gots comparable performance to other facets in terms of response time. 

In terms of algorithms, their listed two "Lingo" and "STC" which I don't
reconize. But I think at least one of them might have used SVD
(http://en.wikipedia.org/wiki/Singular_value_decomposition).

-Yao


Otis Gospodnetic wrote:
> 
> 
> I'd call it related (their application in search encourages exploration),
> but also distinct enough to never mix them up.  I think your assessment
> below is correct, although I'm not familiar with the details of Carrot2
> any more (was once), so I can't tell you exactly which algo is used under
> the hood.
> 
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> - Original Message 
>> From: Michael Ludwig 
>> To: solr-user@lucene.apache.org
>> Sent: Wednesday, June 10, 2009 9:41:54 AM
>> Subject: Re: Faceting on text fields
>> 
>> Otis Gospodnetic schrieb:
>> >
>> > Solr can already cluster top N hits using Carrot2:
>> > http://wiki.apache.org/solr/ClusteringComponent
>> 
>> Would it be fair to say that clustering as detailed on the page you're
>> referring to is a kind of dynamic faceting? The faceting not being done
>> based on distinct values of certain fields, but on the presence (and
>> frequency) of terms in one field?
>> 
>> The main difference seems to be that with faceting, grouping criteria
>> (facets) are known beforehand, while with clustering, grouping criteria
>> (the significant terms which create clusters - the cluster keys) have
>> yet to be determined. Is that a correct assessment?
>> 
>> Michael Ludwig
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Faceting-on-text-fields-tp23872891p23980124.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Faceting on text fields

2009-06-11 Thread Yao Ge

BTW, Carrot2 has a very impressive Clustering Workbench (based on eclipse)
that has built-in integration with Solr. If you have a Solr service running,
it is a just a matter of point the workbench to it. The clustering results
and visualization are amazing. (http://project.carrot2.org/download.html).


Yao Ge wrote:
> 
> FYI. I did a direct integration with Carrot2 with Solrj with a separate
> Ajax call from UI for top 100 hits to clusters terms in the two text
> fields. It gots comparable performance to other facets in terms of
> response time. 
> 
> In terms of algorithms, their listed two "Lingo" and "STC" which I don't
> reconize. But I think at least one of them might have used SVD
> (http://en.wikipedia.org/wiki/Singular_value_decomposition).
> 
> -Yao
> 
> 
> Otis Gospodnetic wrote:
>> 
>> 
>> I'd call it related (their application in search encourages exploration),
>> but also distinct enough to never mix them up.  I think your assessment
>> below is correct, although I'm not familiar with the details of Carrot2
>> any more (was once), so I can't tell you exactly which algo is used under
>> the hood.
>> 
>>  Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> 
>> 
>> 
>> - Original Message 
>>> From: Michael Ludwig 
>>> To: solr-user@lucene.apache.org
>>> Sent: Wednesday, June 10, 2009 9:41:54 AM
>>> Subject: Re: Faceting on text fields
>>> 
>>> Otis Gospodnetic schrieb:
>>> >
>>> > Solr can already cluster top N hits using Carrot2:
>>> > http://wiki.apache.org/solr/ClusteringComponent
>>> 
>>> Would it be fair to say that clustering as detailed on the page you're
>>> referring to is a kind of dynamic faceting? The faceting not being done
>>> based on distinct values of certain fields, but on the presence (and
>>> frequency) of terms in one field?
>>> 
>>> The main difference seems to be that with faceting, grouping criteria
>>> (facets) are known beforehand, while with clustering, grouping criteria
>>> (the significant terms which create clusters - the cluster keys) have
>>> yet to be determined. Is that a correct assessment?
>>> 
>>> Michael Ludwig
>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Faceting-on-text-fields-tp23872891p23980959.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Faceting on text fields

2009-06-11 Thread Michael Ludwig

Yao Ge schrieb:

BTW, Carrot2 has a very impressive Clustering Workbench (based on
eclipse) that has built-in integration with Solr. If you have a Solr
service running, it is a just a matter of point the workbench to it.
The clustering results and visualization are amazing.
(http://project.carrot2.org/download.html).


A new world opens up for me ...

Thanks for pointing out how cool this is!

Hint for other newcomers: Open the View Menu to configure the details of
how you perform your search, e.g. your Solr URL in case it differs from
the default, or your "summary field", which is what gets used to analyze
the data in order to determine clusters, if I understand correctly.

Michael Ludwig


Re: Faceting on text fields

2009-06-11 Thread Jeffrey Tiong
Hi all,

We are thinking of using the carrot clustering too. But we saw that carrot
maybe can only cluster up to 1000 search snippets. Does anyone know how can
we cluster snippets that is much more than that ? (maybe in the million
range?)

And what is the difference between mahout and carrot?

Thank!

Jeffrey

On Thu, Jun 11, 2009 at 9:47 PM, Michael Ludwig  wrote:

> Yao Ge schrieb:
>
>> BTW, Carrot2 has a very impressive Clustering Workbench (based on
>> eclipse) that has built-in integration with Solr. If you have a Solr
>> service running, it is a just a matter of point the workbench to it.
>> The clustering results and visualization are amazing.
>> (http://project.carrot2.org/download.html).
>>
>
> A new world opens up for me ...
>
> Thanks for pointing out how cool this is!
>
> Hint for other newcomers: Open the View Menu to configure the details of
> how you perform your search, e.g. your Solr URL in case it differs from
> the default, or your "summary field", which is what gets used to analyze
> the data in order to determine clusters, if I understand correctly.
>
> Michael Ludwig
>


Re: Faceting on text fields

2009-06-11 Thread Otis Gospodnetic

Jeffrey,

Are you looking to cluster a whole corpus of documents of just the search 
results?  If it's the latter, use Carrot2.  If it's the former, look at Mahout. 
 Clustering top 1M matching documents doesn't really make sense.  Usually top 
100-200 is sufficient.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Jeffrey Tiong 
> To: solr-user@lucene.apache.org
> Sent: Friday, June 12, 2009 12:44:55 AM
> Subject: Re: Faceting on text fields
> 
> Hi all,
> 
> We are thinking of using the carrot clustering too. But we saw that carrot
> maybe can only cluster up to 1000 search snippets. Does anyone know how can
> we cluster snippets that is much more than that ? (maybe in the million
> range?)
> 
> And what is the difference between mahout and carrot?
> 
> Thank!
> 
> Jeffrey
> 
> On Thu, Jun 11, 2009 at 9:47 PM, Michael Ludwig wrote:
> 
> > Yao Ge schrieb:
> >
> >> BTW, Carrot2 has a very impressive Clustering Workbench (based on
> >> eclipse) that has built-in integration with Solr. If you have a Solr
> >> service running, it is a just a matter of point the workbench to it.
> >> The clustering results and visualization are amazing.
> >> (http://project.carrot2.org/download.html).
> >>
> >
> > A new world opens up for me ...
> >
> > Thanks for pointing out how cool this is!
> >
> > Hint for other newcomers: Open the View Menu to configure the details of
> > how you perform your search, e.g. your Solr URL in case it differs from
> > the default, or your "summary field", which is what gets used to analyze
> > the data in order to determine clusters, if I understand correctly.
> >
> > Michael Ludwig
> >



Re: Faceting on text fields

2009-06-11 Thread Jeffrey Tiong
Thanks Otis!

Do you know under what circumstances or application should we cluster the
whole corpus of documents vs just the search results?

Jeffrey

On Fri, Jun 12, 2009 at 1:39 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

>
> Jeffrey,
>
> Are you looking to cluster a whole corpus of documents of just the search
> results?  If it's the latter, use Carrot2.  If it's the former, look at
> Mahout.  Clustering top 1M matching documents doesn't really make sense.
>  Usually top 100-200 is sufficient.
>
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
> > From: Jeffrey Tiong 
> > To: solr-user@lucene.apache.org
> > Sent: Friday, June 12, 2009 12:44:55 AM
> > Subject: Re: Faceting on text fields
> >
> > Hi all,
> >
> > We are thinking of using the carrot clustering too. But we saw that
> carrot
> > maybe can only cluster up to 1000 search snippets. Does anyone know how
> can
> > we cluster snippets that is much more than that ? (maybe in the million
> > range?)
> >
> > And what is the difference between mahout and carrot?
> >
> > Thank!
> >
> > Jeffrey
> >
> > On Thu, Jun 11, 2009 at 9:47 PM, Michael Ludwig wrote:
> >
> > > Yao Ge schrieb:
> > >
> > >> BTW, Carrot2 has a very impressive Clustering Workbench (based on
> > >> eclipse) that has built-in integration with Solr. If you have a Solr
> > >> service running, it is a just a matter of point the workbench to it.
> > >> The clustering results and visualization are amazing.
> > >> (http://project.carrot2.org/download.html).
> > >>
> > >
> > > A new world opens up for me ...
> > >
> > > Thanks for pointing out how cool this is!
> > >
> > > Hint for other newcomers: Open the View Menu to configure the details
> of
> > > how you perform your search, e.g. your Solr URL in case it differs from
> > > the default, or your "summary field", which is what gets used to
> analyze
> > > the data in order to determine clusters, if I understand correctly.
> > >
> > > Michael Ludwig
> > >
>
>


Re: Faceting on text fields

2009-06-12 Thread Stanislaw Osinski
Hi,

Sorry for being late to the party, let me try to clear some doubts about
Carrot2.

Do you know under what circumstances or application should we cluster the
> whole corpus of documents vs just the search results?


I think it depends on what you're trying to achieve. If you'd like to give
the users some alternative way of exploring the search results by organizing
them into semantically related groups (search results clustering), Carrot2
would be the appropriate tool. Its algorithms are designed to work with
small input (up to ~1000 results) and try to provide meaningful labels for
each cluster. Currently, Carrot2 has two algorithms: an implementation of
Suffix Tree Clustering (STC, a classic in search results clustering
research, designed by O. Zamir, implemented by Dawid Weiss) and Lingo
(designed and implemented by myself). STC is very fast compared to Lingo,
but the latter will usually get you better clusters. Some comparison of the
algorithms is here: http://project.carrot2.org/algorithms.html, but
ultimately, I'd encourage you to experiment (e.g. using Clustering
Workbench). For best results, I'd recommend feeding the algorithms with
contextual snippets generated based on the user's query. If the summary
could consist of complete sentence(s) containing the query (as opposed to
individual words delimited by "..."), you should be getting even nicer
labels.

One important thing for search results clustering is that it is done
on-line, so it will add extra time to each search query your server handles.
Plus, to get reasonable clusters, you'd need to fetch at least 50 documents
from your index, which may put more load on the disks as well (sometimes
clustering time may be only be a fraction of the time required to get the
documents from the index).

Finally, to compare search results clustering with facets: UI-wise they may
look similar, but I'd say they're two different things that complement each
other. While the list of facets and their values is fairly static (brand
names etc.), clusters are less "stable" -- they're generated dynamically for
each search and will vary across queries. Plus, as for any other
unsupervised machine learning technique, your clusters will never be 100%
correct (as opposed to facets). Almost always you'll be getting one or two
clusters that don't make much sense.

When it comes to clustering the whole collection, it might be useful in a
couple of scenarios: a) if you wanted to get some high level overview of
what's in your collection, b) if you'd wanted to e.g. use clusters to
re-rank the search results presented to the user (implicit clustering:
showing a few documents from each cluster), c) if you wanted to distribute
your index based on the semantics of the documents (wild guess, I'm not sure
if anyone tried that in practice). In general, I feel clustering the whole
index is much harder than search results clustering not only because of the
different scale, but also because you'd need to tune the algorithm for your
specific needs and data. For example, in scenario a) and a collection of 1M
documents: how many top level clusters do you generate? 10? 1? If it's
10, the clusters may end up too general / meaningless, it might be hard to
describe them concisely. If it's 1, clusters are likely to be more
focused, but hard to browse... I must admit I haven't followed Mahout too
closely, maybe there is some nice way of resolving these problems.

If you have any other questions about Carrot2, I'll try to answer them here.
Alternatively, feel free to join Carrot2 mailing lists.

Thanks,

Staszek

--
http://www.carrot2.org