Thanks for insight Otis. I have no awareness of ClusteringComponent until now. It is time to move to Solr 1.4
-Yao Otis Gospodnetic wrote: > > > Yao, > > Solr can already cluster top N hits using Carrot2: > http://wiki.apache.org/solr/ClusteringComponent > > I've also done ugly "manual counting" of terms in top N hits. For > example, look at the right side of this: > http://www.simpy.com/user/otis/tag/%22machine+learning%22 > > Something like http://www.sematext.com/product-key-phrase-extractor.html > could also be used. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- >> From: Yao Ge <yao...@gmail.com> >> To: solr-user@lucene.apache.org >> Sent: Tuesday, June 9, 2009 3:46:13 PM >> Subject: Re: Faceting on text fields >> >> >> Michael, >> >> Thanks for the update! I definitely need to get a 1.4 build see if it >> makes >> a difference. >> >> BTW, maybe instead of using faceting for text >> mining/clustering/visualization purpose, we can build a separate feature >> in >> SOLR for this. Many of commercial search engines I have experiences with >> (Google Search Appliance, Vivisimo etc) provide dynamic term clustering >> based on top N ranked documents (N is a parameter can be configured). >> When >> facet field is highly fragmented (say a text field), the existing set >> intersection based approach might no longer be optimum. Aggregating term >> vectors over top N docs might be more attractive. Another features I can >> really appreciate is to provide search time n-gram term clustering. Maybe >> this might be better suited for "spell checker" as it just a different >> way >> to display the alternative search terms. >> >> -Yao >> >> >> Michael Ludwig-4 wrote: >> > >> > Yao Ge schrieb: >> > >> >> The facet query is considerably slower comparing to other facets from >> >> structured database fields (with highly repeated values). What I found >> >> interesting is that even after I constrained search results to just a >> >> few hunderd hits using other facets, these text facets are still very >> >> slow. >> >> >> >> I understand that text fields are not good candidate for faceting as >> >> it can contain very large number of unique values. However why it is >> >> still slow after my matching documents is reduced to hundreds? Is it >> >> because the whole filter is cached (regardless the matching docs) and >> >> I don't have enough filter cache size to fit the whole list? >> > >> > Very interesting questions! I think an answer would both require and >> > further an understanding of how filters work, which might even lead to >> > a more general guideline on when and how to use filters and facets. >> > >> > Even though faceting appears to have changed in 1.4 vs 1.3, it would >> > still be interesting to understand the 1.3 side of things. >> > >> >> Lastly, what I really want to is to give user a chance to visualize >> >> and filter on top relevant words in the free-text fields. Are there >> >> alternative to facet field approach? term vectors? I can do client >> >> side process based on top N (say 100) hits for this but it is my last >> >> option. >> > >> > Also a very interesting data mining question! I'm sorry I don't have >> any >> > answers for you. Maybe someone else does. >> > >> > Best, >> > >> > Michael Ludwig >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Faceting-on-text-fields-tp23872891p23950084.html >> Sent from the Solr - User mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23965401.html Sent from the Solr - User mailing list archive at Nabble.com.