Re: Faceting on text fields
Hi, Sorry for being late to the party, let me try to clear some doubts about Carrot2. Do you know under what circumstances or application should we cluster the > whole corpus of documents vs just the search results? I think it depends on what you're trying to achieve. If you'd like to give the users some alternative way of exploring the search results by organizing them into semantically related groups (search results clustering), Carrot2 would be the appropriate tool. Its algorithms are designed to work with small input (up to ~1000 results) and try to provide meaningful labels for each cluster. Currently, Carrot2 has two algorithms: an implementation of Suffix Tree Clustering (STC, a classic in search results clustering research, designed by O. Zamir, implemented by Dawid Weiss) and Lingo (designed and implemented by myself). STC is very fast compared to Lingo, but the latter will usually get you better clusters. Some comparison of the algorithms is here: http://project.carrot2.org/algorithms.html, but ultimately, I'd encourage you to experiment (e.g. using Clustering Workbench). For best results, I'd recommend feeding the algorithms with contextual snippets generated based on the user's query. If the summary could consist of complete sentence(s) containing the query (as opposed to individual words delimited by "..."), you should be getting even nicer labels. One important thing for search results clustering is that it is done on-line, so it will add extra time to each search query your server handles. Plus, to get reasonable clusters, you'd need to fetch at least 50 documents from your index, which may put more load on the disks as well (sometimes clustering time may be only be a fraction of the time required to get the documents from the index). Finally, to compare search results clustering with facets: UI-wise they may look similar, but I'd say they're two different things that complement each other. While the list of facets and their values is fairly static (brand names etc.), clusters are less "stable" -- they're generated dynamically for each search and will vary across queries. Plus, as for any other unsupervised machine learning technique, your clusters will never be 100% correct (as opposed to facets). Almost always you'll be getting one or two clusters that don't make much sense. When it comes to clustering the whole collection, it might be useful in a couple of scenarios: a) if you wanted to get some high level overview of what's in your collection, b) if you'd wanted to e.g. use clusters to re-rank the search results presented to the user (implicit clustering: showing a few documents from each cluster), c) if you wanted to distribute your index based on the semantics of the documents (wild guess, I'm not sure if anyone tried that in practice). In general, I feel clustering the whole index is much harder than search results clustering not only because of the different scale, but also because you'd need to tune the algorithm for your specific needs and data. For example, in scenario a) and a collection of 1M documents: how many top level clusters do you generate? 10? 1? If it's 10, the clusters may end up too general / meaningless, it might be hard to describe them concisely. If it's 1, clusters are likely to be more focused, but hard to browse... I must admit I haven't followed Mahout too closely, maybe there is some nice way of resolving these problems. If you have any other questions about Carrot2, I'll try to answer them here. Alternatively, feel free to join Carrot2 mailing lists. Thanks, Staszek -- http://www.carrot2.org
Re: Faceting on text fields
Thanks Otis! Do you know under what circumstances or application should we cluster the whole corpus of documents vs just the search results? Jeffrey On Fri, Jun 12, 2009 at 1:39 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > > Jeffrey, > > Are you looking to cluster a whole corpus of documents of just the search > results? If it's the latter, use Carrot2. If it's the former, look at > Mahout. Clustering top 1M matching documents doesn't really make sense. > Usually top 100-200 is sufficient. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: Jeffrey Tiong > > To: solr-user@lucene.apache.org > > Sent: Friday, June 12, 2009 12:44:55 AM > > Subject: Re: Faceting on text fields > > > > Hi all, > > > > We are thinking of using the carrot clustering too. But we saw that > carrot > > maybe can only cluster up to 1000 search snippets. Does anyone know how > can > > we cluster snippets that is much more than that ? (maybe in the million > > range?) > > > > And what is the difference between mahout and carrot? > > > > Thank! > > > > Jeffrey > > > > On Thu, Jun 11, 2009 at 9:47 PM, Michael Ludwig wrote: > > > > > Yao Ge schrieb: > > > > > >> BTW, Carrot2 has a very impressive Clustering Workbench (based on > > >> eclipse) that has built-in integration with Solr. If you have a Solr > > >> service running, it is a just a matter of point the workbench to it. > > >> The clustering results and visualization are amazing. > > >> (http://project.carrot2.org/download.html). > > >> > > > > > > A new world opens up for me ... > > > > > > Thanks for pointing out how cool this is! > > > > > > Hint for other newcomers: Open the View Menu to configure the details > of > > > how you perform your search, e.g. your Solr URL in case it differs from > > > the default, or your "summary field", which is what gets used to > analyze > > > the data in order to determine clusters, if I understand correctly. > > > > > > Michael Ludwig > > > > >
Re: Faceting on text fields
Jeffrey, Are you looking to cluster a whole corpus of documents of just the search results? If it's the latter, use Carrot2. If it's the former, look at Mahout. Clustering top 1M matching documents doesn't really make sense. Usually top 100-200 is sufficient. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Jeffrey Tiong > To: solr-user@lucene.apache.org > Sent: Friday, June 12, 2009 12:44:55 AM > Subject: Re: Faceting on text fields > > Hi all, > > We are thinking of using the carrot clustering too. But we saw that carrot > maybe can only cluster up to 1000 search snippets. Does anyone know how can > we cluster snippets that is much more than that ? (maybe in the million > range?) > > And what is the difference between mahout and carrot? > > Thank! > > Jeffrey > > On Thu, Jun 11, 2009 at 9:47 PM, Michael Ludwig wrote: > > > Yao Ge schrieb: > > > >> BTW, Carrot2 has a very impressive Clustering Workbench (based on > >> eclipse) that has built-in integration with Solr. If you have a Solr > >> service running, it is a just a matter of point the workbench to it. > >> The clustering results and visualization are amazing. > >> (http://project.carrot2.org/download.html). > >> > > > > A new world opens up for me ... > > > > Thanks for pointing out how cool this is! > > > > Hint for other newcomers: Open the View Menu to configure the details of > > how you perform your search, e.g. your Solr URL in case it differs from > > the default, or your "summary field", which is what gets used to analyze > > the data in order to determine clusters, if I understand correctly. > > > > Michael Ludwig > >
Re: Faceting on text fields
Hi all, We are thinking of using the carrot clustering too. But we saw that carrot maybe can only cluster up to 1000 search snippets. Does anyone know how can we cluster snippets that is much more than that ? (maybe in the million range?) And what is the difference between mahout and carrot? Thank! Jeffrey On Thu, Jun 11, 2009 at 9:47 PM, Michael Ludwig wrote: > Yao Ge schrieb: > >> BTW, Carrot2 has a very impressive Clustering Workbench (based on >> eclipse) that has built-in integration with Solr. If you have a Solr >> service running, it is a just a matter of point the workbench to it. >> The clustering results and visualization are amazing. >> (http://project.carrot2.org/download.html). >> > > A new world opens up for me ... > > Thanks for pointing out how cool this is! > > Hint for other newcomers: Open the View Menu to configure the details of > how you perform your search, e.g. your Solr URL in case it differs from > the default, or your "summary field", which is what gets used to analyze > the data in order to determine clusters, if I understand correctly. > > Michael Ludwig >
Re: Faceting on text fields
Yao Ge schrieb: BTW, Carrot2 has a very impressive Clustering Workbench (based on eclipse) that has built-in integration with Solr. If you have a Solr service running, it is a just a matter of point the workbench to it. The clustering results and visualization are amazing. (http://project.carrot2.org/download.html). A new world opens up for me ... Thanks for pointing out how cool this is! Hint for other newcomers: Open the View Menu to configure the details of how you perform your search, e.g. your Solr URL in case it differs from the default, or your "summary field", which is what gets used to analyze the data in order to determine clusters, if I understand correctly. Michael Ludwig
Re: Faceting on text fields
BTW, Carrot2 has a very impressive Clustering Workbench (based on eclipse) that has built-in integration with Solr. If you have a Solr service running, it is a just a matter of point the workbench to it. The clustering results and visualization are amazing. (http://project.carrot2.org/download.html). Yao Ge wrote: > > FYI. I did a direct integration with Carrot2 with Solrj with a separate > Ajax call from UI for top 100 hits to clusters terms in the two text > fields. It gots comparable performance to other facets in terms of > response time. > > In terms of algorithms, their listed two "Lingo" and "STC" which I don't > reconize. But I think at least one of them might have used SVD > (http://en.wikipedia.org/wiki/Singular_value_decomposition). > > -Yao > > > Otis Gospodnetic wrote: >> >> >> I'd call it related (their application in search encourages exploration), >> but also distinct enough to never mix them up. I think your assessment >> below is correct, although I'm not familiar with the details of Carrot2 >> any more (was once), so I can't tell you exactly which algo is used under >> the hood. >> >> Otis >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> >> >> - Original Message >>> From: Michael Ludwig >>> To: solr-user@lucene.apache.org >>> Sent: Wednesday, June 10, 2009 9:41:54 AM >>> Subject: Re: Faceting on text fields >>> >>> Otis Gospodnetic schrieb: >>> > >>> > Solr can already cluster top N hits using Carrot2: >>> > http://wiki.apache.org/solr/ClusteringComponent >>> >>> Would it be fair to say that clustering as detailed on the page you're >>> referring to is a kind of dynamic faceting? The faceting not being done >>> based on distinct values of certain fields, but on the presence (and >>> frequency) of terms in one field? >>> >>> The main difference seems to be that with faceting, grouping criteria >>> (facets) are known beforehand, while with clustering, grouping criteria >>> (the significant terms which create clusters - the cluster keys) have >>> yet to be determined. Is that a correct assessment? >>> >>> Michael Ludwig >> >> >> > > -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23980959.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
FYI. I did a direct integration with Carrot2 with Solrj with a separate Ajax call from UI for top 100 hits to clusters terms in the two text fields. It gots comparable performance to other facets in terms of response time. In terms of algorithms, their listed two "Lingo" and "STC" which I don't reconize. But I think at least one of them might have used SVD (http://en.wikipedia.org/wiki/Singular_value_decomposition). -Yao Otis Gospodnetic wrote: > > > I'd call it related (their application in search encourages exploration), > but also distinct enough to never mix them up. I think your assessment > below is correct, although I'm not familiar with the details of Carrot2 > any more (was once), so I can't tell you exactly which algo is used under > the hood. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message >> From: Michael Ludwig >> To: solr-user@lucene.apache.org >> Sent: Wednesday, June 10, 2009 9:41:54 AM >> Subject: Re: Faceting on text fields >> >> Otis Gospodnetic schrieb: >> > >> > Solr can already cluster top N hits using Carrot2: >> > http://wiki.apache.org/solr/ClusteringComponent >> >> Would it be fair to say that clustering as detailed on the page you're >> referring to is a kind of dynamic faceting? The faceting not being done >> based on distinct values of certain fields, but on the presence (and >> frequency) of terms in one field? >> >> The main difference seems to be that with faceting, grouping criteria >> (facets) are known beforehand, while with clustering, grouping criteria >> (the significant terms which create clusters - the cluster keys) have >> yet to be determined. Is that a correct assessment? >> >> Michael Ludwig > > > -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23980124.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
I'd call it related (their application in search encourages exploration), but also distinct enough to never mix them up. I think your assessment below is correct, although I'm not familiar with the details of Carrot2 any more (was once), so I can't tell you exactly which algo is used under the hood. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Michael Ludwig > To: solr-user@lucene.apache.org > Sent: Wednesday, June 10, 2009 9:41:54 AM > Subject: Re: Faceting on text fields > > Otis Gospodnetic schrieb: > > > > Solr can already cluster top N hits using Carrot2: > > http://wiki.apache.org/solr/ClusteringComponent > > Would it be fair to say that clustering as detailed on the page you're > referring to is a kind of dynamic faceting? The faceting not being done > based on distinct values of certain fields, but on the presence (and > frequency) of terms in one field? > > The main difference seems to be that with faceting, grouping criteria > (facets) are known beforehand, while with clustering, grouping criteria > (the significant terms which create clusters - the cluster keys) have > yet to be determined. Is that a correct assessment? > > Michael Ludwig
Re: Faceting on text fields
Thanks for insight Otis. I have no awareness of ClusteringComponent until now. It is time to move to Solr 1.4 -Yao Otis Gospodnetic wrote: > > > Yao, > > Solr can already cluster top N hits using Carrot2: > http://wiki.apache.org/solr/ClusteringComponent > > I've also done ugly "manual counting" of terms in top N hits. For > example, look at the right side of this: > http://www.simpy.com/user/otis/tag/%22machine+learning%22 > > Something like http://www.sematext.com/product-key-phrase-extractor.html > could also be used. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message >> From: Yao Ge >> To: solr-user@lucene.apache.org >> Sent: Tuesday, June 9, 2009 3:46:13 PM >> Subject: Re: Faceting on text fields >> >> >> Michael, >> >> Thanks for the update! I definitely need to get a 1.4 build see if it >> makes >> a difference. >> >> BTW, maybe instead of using faceting for text >> mining/clustering/visualization purpose, we can build a separate feature >> in >> SOLR for this. Many of commercial search engines I have experiences with >> (Google Search Appliance, Vivisimo etc) provide dynamic term clustering >> based on top N ranked documents (N is a parameter can be configured). >> When >> facet field is highly fragmented (say a text field), the existing set >> intersection based approach might no longer be optimum. Aggregating term >> vectors over top N docs might be more attractive. Another features I can >> really appreciate is to provide search time n-gram term clustering. Maybe >> this might be better suited for "spell checker" as it just a different >> way >> to display the alternative search terms. >> >> -Yao >> >> >> Michael Ludwig-4 wrote: >> > >> > Yao Ge schrieb: >> > >> >> The facet query is considerably slower comparing to other facets from >> >> structured database fields (with highly repeated values). What I found >> >> interesting is that even after I constrained search results to just a >> >> few hunderd hits using other facets, these text facets are still very >> >> slow. >> >> >> >> I understand that text fields are not good candidate for faceting as >> >> it can contain very large number of unique values. However why it is >> >> still slow after my matching documents is reduced to hundreds? Is it >> >> because the whole filter is cached (regardless the matching docs) and >> >> I don't have enough filter cache size to fit the whole list? >> > >> > Very interesting questions! I think an answer would both require and >> > further an understanding of how filters work, which might even lead to >> > a more general guideline on when and how to use filters and facets. >> > >> > Even though faceting appears to have changed in 1.4 vs 1.3, it would >> > still be interesting to understand the 1.3 side of things. >> > >> >> Lastly, what I really want to is to give user a chance to visualize >> >> and filter on top relevant words in the free-text fields. Are there >> >> alternative to facet field approach? term vectors? I can do client >> >> side process based on top N (say 100) hits for this but it is my last >> >> option. >> > >> > Also a very interesting data mining question! I'm sorry I don't have >> any >> > answers for you. Maybe someone else does. >> > >> > Best, >> > >> > Michael Ludwig >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Faceting-on-text-fields-tp23872891p23950084.html >> Sent from the Solr - User mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23965401.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
Otis Gospodnetic schrieb: Solr can already cluster top N hits using Carrot2: http://wiki.apache.org/solr/ClusteringComponent Would it be fair to say that clustering as detailed on the page you're referring to is a kind of dynamic faceting? The faceting not being done based on distinct values of certain fields, but on the presence (and frequency) of terms in one field? The main difference seems to be that with faceting, grouping criteria (facets) are known beforehand, while with clustering, grouping criteria (the significant terms which create clusters - the cluster keys) have yet to be determined. Is that a correct assessment? Michael Ludwig
Re: Faceting on text fields
Yonik Seeley schrieb: Yep, all that sounds right. An additional optimization counts terms for the documents *not* in the set when the base set is over half the size of the index. Cool :-) Thanks for confirming my assumptions! Michael Ludwig
Re: Faceting on text fields
Yao, Solr can already cluster top N hits using Carrot2: http://wiki.apache.org/solr/ClusteringComponent I've also done ugly "manual counting" of terms in top N hits. For example, look at the right side of this: http://www.simpy.com/user/otis/tag/%22machine+learning%22 Something like http://www.sematext.com/product-key-phrase-extractor.html could also be used. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Yao Ge > To: solr-user@lucene.apache.org > Sent: Tuesday, June 9, 2009 3:46:13 PM > Subject: Re: Faceting on text fields > > > Michael, > > Thanks for the update! I definitely need to get a 1.4 build see if it makes > a difference. > > BTW, maybe instead of using faceting for text > mining/clustering/visualization purpose, we can build a separate feature in > SOLR for this. Many of commercial search engines I have experiences with > (Google Search Appliance, Vivisimo etc) provide dynamic term clustering > based on top N ranked documents (N is a parameter can be configured). When > facet field is highly fragmented (say a text field), the existing set > intersection based approach might no longer be optimum. Aggregating term > vectors over top N docs might be more attractive. Another features I can > really appreciate is to provide search time n-gram term clustering. Maybe > this might be better suited for "spell checker" as it just a different way > to display the alternative search terms. > > -Yao > > > Michael Ludwig-4 wrote: > > > > Yao Ge schrieb: > > > >> The facet query is considerably slower comparing to other facets from > >> structured database fields (with highly repeated values). What I found > >> interesting is that even after I constrained search results to just a > >> few hunderd hits using other facets, these text facets are still very > >> slow. > >> > >> I understand that text fields are not good candidate for faceting as > >> it can contain very large number of unique values. However why it is > >> still slow after my matching documents is reduced to hundreds? Is it > >> because the whole filter is cached (regardless the matching docs) and > >> I don't have enough filter cache size to fit the whole list? > > > > Very interesting questions! I think an answer would both require and > > further an understanding of how filters work, which might even lead to > > a more general guideline on when and how to use filters and facets. > > > > Even though faceting appears to have changed in 1.4 vs 1.3, it would > > still be interesting to understand the 1.3 side of things. > > > >> Lastly, what I really want to is to give user a chance to visualize > >> and filter on top relevant words in the free-text fields. Are there > >> alternative to facet field approach? term vectors? I can do client > >> side process based on top N (say 100) hits for this but it is my last > >> option. > > > > Also a very interesting data mining question! I'm sorry I don't have any > > answers for you. Maybe someone else does. > > > > Best, > > > > Michael Ludwig > > > > > > -- > View this message in context: > http://www.nabble.com/Faceting-on-text-fields-tp23872891p23950084.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
Michael, Thanks for the update! I definitely need to get a 1.4 build see if it makes a difference. BTW, maybe instead of using faceting for text mining/clustering/visualization purpose, we can build a separate feature in SOLR for this. Many of commercial search engines I have experiences with (Google Search Appliance, Vivisimo etc) provide dynamic term clustering based on top N ranked documents (N is a parameter can be configured). When facet field is highly fragmented (say a text field), the existing set intersection based approach might no longer be optimum. Aggregating term vectors over top N docs might be more attractive. Another features I can really appreciate is to provide search time n-gram term clustering. Maybe this might be better suited for "spell checker" as it just a different way to display the alternative search terms. -Yao Michael Ludwig-4 wrote: > > Yao Ge schrieb: > >> The facet query is considerably slower comparing to other facets from >> structured database fields (with highly repeated values). What I found >> interesting is that even after I constrained search results to just a >> few hunderd hits using other facets, these text facets are still very >> slow. >> >> I understand that text fields are not good candidate for faceting as >> it can contain very large number of unique values. However why it is >> still slow after my matching documents is reduced to hundreds? Is it >> because the whole filter is cached (regardless the matching docs) and >> I don't have enough filter cache size to fit the whole list? > > Very interesting questions! I think an answer would both require and > further an understanding of how filters work, which might even lead to > a more general guideline on when and how to use filters and facets. > > Even though faceting appears to have changed in 1.4 vs 1.3, it would > still be interesting to understand the 1.3 side of things. > >> Lastly, what I really want to is to give user a chance to visualize >> and filter on top relevant words in the free-text fields. Are there >> alternative to facet field approach? term vectors? I can do client >> side process based on top N (say 100) hits for this but it is my last >> option. > > Also a very interesting data mining question! I'm sorry I don't have any > answers for you. Maybe someone else does. > > Best, > > Michael Ludwig > > -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23950084.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
Yep, all that sounds right. An additional optimization counts terms for the documents *not* in the set when the base set is over half the size of the index. -Yonik http://www.lucidimagination.com On Tue, Jun 9, 2009 at 1:01 PM, Michael Ludwig wrote: > Yonik, > > from your initial comment for SOLR-475: > > | * To save space and speed up faceting, any term that matches enough > | * documents will not be un-inverted... it will be skipped while > | * building the un-inverted field structore, and will use a set > | * intersection method during faceting. > > Does this mean that frequently occurring terms (which we can use for > faceting in 1.3 without problems) are handled exactly as they were > before, by allocating a slot in the filter cache upon request, while > those zillions of pesky little fringe terms outside the mainstream, > for which allocating a slot in the filter cache would be overkill > (and possibly cause inefficient contention, eviction, and, hence, > a performance penalty) are now handled by the new structure mapping > documents to term numbers? > > So doing faceting for a given set of documents would result in (a) doing > set intersection using those filter query results that have been set up > (for the terms occurring in many documents), and (b) collecting all the > pesky little terms from the new structure mapping documents to term > numbers? > > So basically, depending on expediency, you (a) know the facets and count > the documents which display them, or you (b) take the documents and see > what facets they have? > > Michael Ludwig >
Re: Faceting on text fields
Yonik Seeley schrieb: Are you using Solr 1.3? You might want to try the latest 1.4 test build - faceting has changed a lot. I found two significant changes (but there may well be more): [#SOLR-911] multi-select facets - ASF JIRA https://issues.apache.org/jira/browse/SOLR-911 Yao, it sounds like the following (which is in 1.4) might have a chance of helping your faceting performance issue: [#SOLR-475] multi-valued faceting via un-inverted field - ASF JIRA https://issues.apache.org/jira/browse/SOLR-475 Yonik, from your initial comment for SOLR-475: | * To save space and speed up faceting, any term that matches enough | * documents will not be un-inverted... it will be skipped while | * building the un-inverted field structore, and will use a set | * intersection method during faceting. Does this mean that frequently occurring terms (which we can use for faceting in 1.3 without problems) are handled exactly as they were before, by allocating a slot in the filter cache upon request, while those zillions of pesky little fringe terms outside the mainstream, for which allocating a slot in the filter cache would be overkill (and possibly cause inefficient contention, eviction, and, hence, a performance penalty) are now handled by the new structure mapping documents to term numbers? So doing faceting for a given set of documents would result in (a) doing set intersection using those filter query results that have been set up (for the terms occurring in many documents), and (b) collecting all the pesky little terms from the new structure mapping documents to term numbers? So basically, depending on expediency, you (a) know the facets and count the documents which display them, or you (b) take the documents and see what facets they have? Michael Ludwig
Re: Faceting on text fields
Yao Ge schrieb: The facet query is considerably slower comparing to other facets from structured database fields (with highly repeated values). What I found interesting is that even after I constrained search results to just a few hunderd hits using other facets, these text facets are still very slow. I understand that text fields are not good candidate for faceting as it can contain very large number of unique values. However why it is still slow after my matching documents is reduced to hundreds? Is it because the whole filter is cached (regardless the matching docs) and I don't have enough filter cache size to fit the whole list? Very interesting questions! I think an answer would both require and further an understanding of how filters work, which might even lead to a more general guideline on when and how to use filters and facets. Even though faceting appears to have changed in 1.4 vs 1.3, it would still be interesting to understand the 1.3 side of things. Lastly, what I really want to is to give user a chance to visualize and filter on top relevant words in the free-text fields. Are there alternative to facet field approach? term vectors? I can do client side process based on top N (say 100) hits for this but it is my last option. Also a very interesting data mining question! I'm sorry I don't have any answers for you. Maybe someone else does. Best, Michael Ludwig
Re: Faceting on text fields
Yes. I am using 1.3. When is 1.4 due for release? Yonik Seeley-2 wrote: > > Are you using Solr 1.3? > You might want to try the latest 1.4 test build - faceting has changed a > lot. > > -Yonik > http://www.lucidimagination.com > > On Thu, Jun 4, 2009 at 12:01 PM, Yao Ge wrote: >> >> I am index a database with over 1 millions rows. Two of fields contain >> unstructured text but size of each fields is limited (256 characters). >> >> I come up with an idea to use visualize the text fields using text cloud >> by >> turning the two text fields in facets. The weight of font and size is of >> each facet value (words) derived from the facet counts. I used simpler >> field >> type so that the there is no stemming to these facet values: >> > positionIncrementGap="100" >>> >> >> >> > ignoreCase="true" expand="false"/> >> > words="stopwords.txt"/> >> > generateWordParts="0" generateNumberParts="0" catenateWords="1" >> catenateNumbers="1" catenateAll="0"/> >> >> >> >> >> >> The facet query is considerably slower comparing to other facets from >> structured database fields (with highly repeated values). What I found >> interesting is that even after I constrained search results to just a few >> hunderd hits using other facets, these text facets are still very slow. >> >> I understand that text fields are not good candidate for faceting as it >> can >> contain very large number of unique values. However why it is still slow >> after my matching documents is reduced to hundreds? Is it because the >> whole >> filter is cached (regardless the matching docs) and I don't have enough >> filter cache size to fit the whole list? >> >> The following is my filterCahce setting: >> > autowarmCount="128"/> >> >> Lastly, what I really want to is to give user a chance to visualize and >> filter on top relevant words in the free-text fields. Are there >> alternative >> to facet field approach? term vectors? I can do client side process based >> on >> top N (say 100) hits for this but it is my last option. >> -- >> View this message in context: >> http://www.nabble.com/Faceting-on-text-fields-tp23872891p23872891.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23876051.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
Are you using Solr 1.3? You might want to try the latest 1.4 test build - faceting has changed a lot. -Yonik http://www.lucidimagination.com On Thu, Jun 4, 2009 at 12:01 PM, Yao Ge wrote: > > I am index a database with over 1 millions rows. Two of fields contain > unstructured text but size of each fields is limited (256 characters). > > I come up with an idea to use visualize the text fields using text cloud by > turning the two text fields in facets. The weight of font and size is of > each facet value (words) derived from the facet counts. I used simpler field > type so that the there is no stemming to these facet values: > > > > > ignoreCase="true" expand="false"/> > words="stopwords.txt"/> > generateWordParts="0" generateNumberParts="0" catenateWords="1" > catenateNumbers="1" catenateAll="0"/> > > > > > > The facet query is considerably slower comparing to other facets from > structured database fields (with highly repeated values). What I found > interesting is that even after I constrained search results to just a few > hunderd hits using other facets, these text facets are still very slow. > > I understand that text fields are not good candidate for faceting as it can > contain very large number of unique values. However why it is still slow > after my matching documents is reduced to hundreds? Is it because the whole > filter is cached (regardless the matching docs) and I don't have enough > filter cache size to fit the whole list? > > The following is my filterCahce setting: > autowarmCount="128"/> > > Lastly, what I really want to is to give user a chance to visualize and > filter on top relevant words in the free-text fields. Are there alternative > to facet field approach? term vectors? I can do client side process based on > top N (say 100) hits for this but it is my last option. > -- > View this message in context: > http://www.nabble.com/Faceting-on-text-fields-tp23872891p23872891.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Faceting on text fields
I am index a database with over 1 millions rows. Two of fields contain unstructured text but size of each fields is limited (256 characters). I come up with an idea to use visualize the text fields using text cloud by turning the two text fields in facets. The weight of font and size is of each facet value (words) derived from the facet counts. I used simpler field type so that the there is no stemming to these facet values: The facet query is considerably slower comparing to other facets from structured database fields (with highly repeated values). What I found interesting is that even after I constrained search results to just a few hunderd hits using other facets, these text facets are still very slow. I understand that text fields are not good candidate for faceting as it can contain very large number of unique values. However why it is still slow after my matching documents is reduced to hundreds? Is it because the whole filter is cached (regardless the matching docs) and I don't have enough filter cache size to fit the whole list? The following is my filterCahce setting: Lastly, what I really want to is to give user a chance to visualize and filter on top relevant words in the free-text fields. Are there alternative to facet field approach? term vectors? I can do client side process based on top N (say 100) hits for this but it is my last option. -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23872891.html Sent from the Solr - User mailing list archive at Nabble.com.