Thanks for insight Otis. I have no awareness of ClusteringComponent until
now. It is time to move to Solr 1.4

-Yao

Otis Gospodnetic wrote:
> 
> 
> Yao,
> 
> Solr can already cluster top N hits using Carrot2:
> http://wiki.apache.org/solr/ClusteringComponent
> 
> I've also done ugly "manual counting" of terms in top N hits.  For
> example, look at the right side of this:
> http://www.simpy.com/user/otis/tag/%22machine+learning%22
> 
> Something like http://www.sematext.com/product-key-phrase-extractor.html
> could also be used.
> 
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: Yao Ge <yao...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Tuesday, June 9, 2009 3:46:13 PM
>> Subject: Re: Faceting on text fields
>> 
>> 
>> Michael,
>> 
>> Thanks for the update! I definitely need to get a 1.4 build see if it
>> makes
>> a difference.
>> 
>> BTW, maybe instead of using faceting for text
>> mining/clustering/visualization purpose, we can build a separate feature
>> in
>> SOLR for this. Many of commercial search engines I have experiences with
>> (Google Search Appliance, Vivisimo etc) provide dynamic term clustering
>> based on top N ranked documents (N is a parameter can be configured).
>> When
>> facet field is highly fragmented (say a text field), the existing set
>> intersection based approach might no longer be optimum. Aggregating term
>> vectors over top N docs might be more attractive. Another features I can
>> really appreciate is to provide search time n-gram term clustering. Maybe
>> this might be better suited for "spell checker" as it just a different
>> way
>> to display the alternative search terms.
>> 
>> -Yao
>> 
>> 
>> Michael Ludwig-4 wrote:
>> > 
>> > Yao Ge schrieb:
>> > 
>> >> The facet query is considerably slower comparing to other facets from
>> >> structured database fields (with highly repeated values). What I found
>> >> interesting is that even after I constrained search results to just a
>> >> few hunderd hits using other facets, these text facets are still very
>> >> slow.
>> >>
>> >> I understand that text fields are not good candidate for faceting as
>> >> it can contain very large number of unique values. However why it is
>> >> still slow after my matching documents is reduced to hundreds? Is it
>> >> because the whole filter is cached (regardless the matching docs) and
>> >> I don't have enough filter cache size to fit the whole list?
>> > 
>> > Very interesting questions! I think an answer would both require and
>> > further an understanding of how filters work, which might even lead to
>> > a more general guideline on when and how to use filters and facets.
>> > 
>> > Even though faceting appears to have changed in 1.4 vs 1.3, it would
>> > still be interesting to understand the 1.3 side of things.
>> > 
>> >> Lastly, what I really want to is to give user a chance to visualize
>> >> and filter on top relevant words in the free-text fields. Are there
>> >> alternative to facet field approach? term vectors? I can do client
>> >> side process based on top N (say 100) hits for this but it is my last
>> >> option.
>> > 
>> > Also a very interesting data mining question! I'm sorry I don't have
>> any
>> > answers for you. Maybe someone else does.
>> > 
>> > Best,
>> > 
>> > Michael Ludwig
>> > 
>> > 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Faceting-on-text-fields-tp23872891p23950084.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Faceting-on-text-fields-tp23872891p23965401.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to