Re: Facet Performance
Hoss, This is still extremely interesting area for possible improvements; I simply don't want the topic to die http://www.nabble.com/Facet-Performance-td7746964.html http://issues.apache.org/jira/browse/SOLR-665 http://issues.apache.org/jira/browse/SOLR-667 http://issues.apache.org/jira/browse/SOLR-669 I am currently using faceting on single-valued _tokenized_ field with huge amount of documents; _unsynchronized_ version of FIFOCache; 1.5 seconds average response time (for faceted queries only!) I think we can use additional cache for facet results (to store calculated values!); Lucene's FieldCache can be used only for non-tokenized single-valued non-bollean fields -Fuad hossman_lucene wrote: > > > : Unfortunately which strategy will be chosen is currently undocumented > : and control is a bit oblique: If the field is tokenized or multivalued > : or Boolean, the FilterQuery method will be used; otherwise the > : FieldCache method. I expect I or others will improve that shortly. > > Bear in mind, what's provide out of the box is "SimpleFacets" ... it's > designed to meet simple faceting needs ... when you start talking about > 100s or thousands of constraints per facet, you are getting outside the > scope of what it was intended to serve efficiently. > > At a certain point the only practical thing to do is write a custom > request handler that makes the best choices for your data. > > For the record: a really simple patch someone could submit would be to > make add an optional field based param indicating which type of faceting > (termenum/fieldcache) should be used to generate the list of terms and > then make SimpleFacets.getFacetFieldCounts use that and call the > apprpriate method insteado calling getTermCounts -- that way you could > force one or the other if you know it's better for your data/query. > > > > -Hoss > > > -- View this message in context: http://www.nabble.com/Facet-Performance-tp7746964p18756500.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facet Performance
On 12/7/06, Andrew Nagy <[EMAIL PROTECTED]> wrote: In September there was a thread [1] on this list about heterogeneous facets and their performance. I am having a similar issue and am unclear as the resolution of this thread. I performed a search against my dataset (492,000 records) and got the results I am looking for in .3 seconds. I then set facet to true and got results in 16 seconds and the facets include data that is not in my result set, it is from the entire set. How do I limit the faceting to my results set and speed up the results? 1) facet on single-valued strings if you can 2) if you can't do (1) then enlarge the fieldcache so that the number of filters (one per possible term in the field you are filtering on) can fit. 3) facet counts are limited to the results of the query, filtered by any filters. Is there a reason you think they are not? -Yonik
Re: Facet Performance
Yonik Seeley wrote: 1) facet on single-valued strings if you can 2) if you can't do (1) then enlarge the fieldcache so that the number of filters (one per possible term in the field you are filtering on) can fit. I wll try this out. 3) facet counts are limited to the results of the query, filtered by any filters. Is there a reason you think they are not? No, you are right. I was thrown off at 1st. On complaint about the faceting though: Why is the element that is returned called "1st". This seems like a poor choice for an element name. Why not just name the element what is in the "name" attribute? It would make parsing much easier! Thanks! Andrew
Re: Facet Performance
On 12/7/06, Andrew Nagy <[EMAIL PROTECTED]> wrote: On complaint about the faceting though: Why is the element that is returned called "1st". I think maybe you are seeing lst (it starts with an L, not a one). It is short for NamedList, an ordered list who's elements are named. This seems like a poor choice for an element name. Why not just name the element what is in the "name" attribute? It would make parsing much easier! When the XML was first conceived, there was a preference for limiting the number of tags. The structure could have been inverted so that -Yonik
Re: Facet Performance
: > This seems like a poor choice for an element : > name. Why not just name the element what is in the "name" attribute? : > It would make parsing much easier! : : When the XML was first conceived, there was a preference for limiting : the number of tags. : The structure could have been inverted so that : ...but then we couldn't support arbitrary field names, and it would be impossible to validate the XML docs independent of hte schema, see this previous explanation... http://www.nabble.com/Default-XML-Output-Schema-tf2312439.html#a643 -Hoss
Re: Facet Performance
Yonik Seeley wrote: 1) facet on single-valued strings if you can 2) if you can't do (1) then enlarge the fieldcache so that the number of filters (one per possible term in the field you are filtering on) can fit. I changed the filterCache to the following: However a search that normally takes .04s is taking 74 seconds once I use the facets since I am faceting on 4 fields. Can you suggest a better configuration that would solve this performance issue, or should I not use faceting? I figure I could run the query twice, once limited to 20 records and then again with the limit set to the total number of records and develop my own facets. I have infact done this before with a different back-end and my code is processed in under .01 seconds. Why is faceting so slow? Andrew
Re: Facet Performance
On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote: I changed the filterCache to the following: However a search that normally takes .04s is taking 74 seconds once I use the facets since I am faceting on 4 fields. The first time or subsequent times? Is your filterCache big enough yet? What do you see for evictions and hit ratio? Can you suggest a better configuration that would solve this performance issue, or should I not use faceting? Faceting isn't something that will always be fast... one often needs to design things in a way that it can be fast. Can you give some examples of your faceted queries? Can you show the field and fieldtype definitions for the fields you are faceting on? For each field that you are faceting on, how many different terms are in it? I figure I could run the query twice, once limited to 20 records and then again with the limit set to the total number of records and develop my own facets. I have infact done this before with a different back-end and my code is processed in under .01 seconds. Why is faceting so slow? It's computationally expensive to get exact facet counts for a large number of hits, and that is what the current faceting code is designed to do. No single method will be appropriate *and* fast for all scenarios. Another method that hasn't been implemented is some statistical faceting based on the top hits, using stored fields or stored term vectors. -Yonik
Re: Facet Performance
Yonik Seeley wrote: On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote: I changed the filterCache to the following: However a search that normally takes .04s is taking 74 seconds once I use the facets since I am faceting on 4 fields. The first time or subsequent times? Is your filterCache big enough yet? What do you see for evictions and hit ratio? Here are the stats, Im still a newbie to SOLR, so Im not totally sure what this all means: lookups : 1530036 hits : 2 hitratio : 0.00 inserts : 1530035 evictions : 1504435 size : 25600 cumulative_lookups : 1530036 cumulative_hits : 2 cumulative_hitratio : 0.00 cumulative_inserts : 1530035 cumulative_evictions : 1504435 Could you suggest a better configuration based on this? Can you suggest a better configuration that would solve this performance issue, or should I not use faceting? Faceting isn't something that will always be fast... one often needs to design things in a way that it can be fast. Can you give some examples of your faceted queries? Can you show the field and fieldtype definitions for the fields you are faceting on? For each field that you are faceting on, how many different terms are in it? My data is 492,000 records of book data. I am faceting on 4 fields: author, subject, language, format. Format and language are fairly simple as their are only a few unique terms. Author and subject however are much different in that there are thousands of unique terms. Thanks for your help! Andrew
Re: Facet Performance
: Here are the stats, Im still a newbie to SOLR, so Im not totally sure : what this all means: : lookups : 1530036 : hits : 2 : hitratio : 0.00 : inserts : 1530035 : evictions : 1504435 : size : 25600 those numbers are telling you that your cache is capable of holding 25,600 items. you have attempted to lookup something in the cache 1,530,036 times, and only 2 of those times did you get a hit. you have added 1,530,035 items to the cache, and 1,504,435 items have been removed from your cache to make room for newer items. in short: your cache isn't really helping you at all. : Could you suggest a better configuration based on this? If that's what your stats look like after a single request, then i would guess you would need to make your cache size at least 1.6 million in order for it to be of any use in improving your facet speed. : My data is 492,000 records of book data. I am faceting on 4 fields: : author, subject, language, format. : Format and language are fairly simple as their are only a few unique : terms. Author and subject however are much different in that there are : thousands of unique terms. by the looks of it, you have a lot more then a few thousand unique terms in those two fields ... are you tokenizing on these fields? that's probably not what you want for ields you're going to facet on. -Hoss
Re: Facet Performance
On 12/8/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: : My data is 492,000 records of book data. I am faceting on 4 fields: : author, subject, language, format. : Format and language are fairly simple as their are only a few unique : terms. Author and subject however are much different in that there are : thousands of unique terms. by the looks of it, you have a lot more then a few thousand unique terms in those two fields ... are you tokenizing on these fields? that's probably not what you want for ields you're going to facet on. Right, if any of these are tokenized, then you could make them non-tokenized (use "string" type). If they really need to be tokenized (author for example), then you could use copyField to make another copy to a non-tokenized field that you can use for faceting. After that, as Hoss suggests, run a single faceting query with all 4 fields and look at the filterCache statistics. Take the "lookups" number and multiply it by, say, 1.5 to leave some room for future growth, and use that as your cache size. You probably want to bump up both initialSize and autowarmCount as well. The first query will still be slow. The second should be relatively fast. You may hit an OOM error. Increase the JVM heap size if this happens. -Yonik
Re: Facet Performance
Chris Hostetter wrote: : Could you suggest a better configuration based on this? If that's what your stats look like after a single request, then i would guess you would need to make your cache size at least 1.6 million in order for it to be of any use in improving your facet speed. Would this have any strong impacts on my system? Should I just set it to an even 2 million to allow for growth? : My data is 492,000 records of book data. I am faceting on 4 fields: : author, subject, language, format. : Format and language are fairly simple as their are only a few unique : terms. Author and subject however are much different in that there are : thousands of unique terms. by the looks of it, you have a lot more then a few thousand unique terms in those two fields ... are you tokenizing on these fields? that's probably not what you want for ields you're going to facet on. All of these fields are set as "string" in my schema, so if I understand the fields correctly, they are not being tokenized. I also have an author field that is set as "text" for searching. Thanks Andrew
Re: Facet Performance
On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote: Chris Hostetter wrote: >: Could you suggest a better configuration based on this? > >If that's what your stats look like after a single request, then i would >guess you would need to make your cache size at least 1.6 million in order >for it to be of any use in improving your facet speed. > > Would this have any strong impacts on my system? Should I just set it to an even 2 million to allow for growth? Change the following in solrconfig.xml, and you should be fine with a higher setting. true to false That will prevent the filtercache from being used for anything but filters and faceting, so if you set it to high, it won't be utilized anyway. >: My data is 492,000 records of book data. I am faceting on 4 fields: >: author, subject, language, format. >: Format and language are fairly simple as their are only a few unique >: terms. Author and subject however are much different in that there are >: thousands of unique terms. > >by the looks of it, you have a lot more then a few thousand unique terms >in those two fields ... are you tokenizing on these fields? that's >probably not what you want for ields you're going to facet on. > > All of these fields are set as "string" in my schema Are they multivalued, and do they need to be. Anything that is of type "string" and not multivalued will use the lucene FieldCache rather than the filterCache. -Yonik
Re: Facet Performance
Yonik Seeley wrote: Are they multivalued, and do they need to be. Anything that is of type "string" and not multivalued will use the lucene FieldCache rather than the filterCache. The author field is multivalued. Will this be a strong performance issue? I could make multiple author fields as to not have the multivalued field and then only facet on the first author. Thanks Andrew
Re: Facet Performance
Andrew Nagy, ditto on what Yonik said. Here is some further elaboration: I am doing much the same thing (faceting on Author etc.). When my Author field was defined as a solr.TextField, even using solr.KeywordTokenizerFactory so it wasn't actually tokenized, the faceting code chose the QueryFilter approach, and faceting on Author for 100k+ document took about 4 seconds. When I changed the field to "string" e.g. solr.StrField, the faceting code recognized it as untokenized and used the FieldCache approach. Times have dropped to about 120ms for the first query (when the FieldCache is generated) and < 10ms for subsequent queries returning a few thousand results. Quite a difference. The strategy must be chosen on a field-by-field basis. While QueryFilter is excellent for fields with a small set of enumerated values such as Language or Format, it is inappropriate for large value sets such as Author. Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique: If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method. I expect I or others will improve that shortly. - J.J. At 2:58 PM -0500 12/8/06, Yonik Seeley wrote: >Right, if any of these are tokenized, then you could make them >non-tokenized (use "string" type). If they really need to be >tokenized (author for example), then you could use copyField to make >another copy to a non-tokenized field that you can use for faceting. > >After that, as Hoss suggests, run a single faceting query with all 4 >fields and look at the filterCache statistics. Take the "lookups" >number and multiply it by, say, 1.5 to leave some room for future >growth, and use that as your cache size. You probably want to bump up >both initialSize and autowarmCount as well. > >The first query will still be slow. The second should be relatively fast. >You may hit an OOM error. Increase the JVM heap size if this happens. > >-Yonik
Re: Facet Performance
On 12/8/06, J.J. Larrea <[EMAIL PROTECTED]> wrote: Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique: If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method. If anyone had time some of this could be documented here: http://wiki.apache.org/solr/SimpleFacetParameters The wiki is open to all. Or perhaps a new top level FacetedSearching page that references SimpleFacetParameters -Yonik
Re: Facet Performance
J.J. Larrea wrote: Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique: If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method. I expect I or others will improve that shortly. Good to hear, cause I can't really get away with not having a multi-valued field for author. Im really excited by solr and really impressed so far. Thanks! Andrew
Re: Facet Performance
On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote: My data is 492,000 records of book data. I am faceting on 4 fields: author, subject, language, format. Format and language are fairly simple as their are only a few unique terms. Author and subject however are much different in that there are thousands of unique terms. When encountering difficult issues, I like to think in terms of the user interface. Surely you're not presenting 400k+ authors to the users in one shot. In Collex, we have put an AJAX drop-down that shows the author facet (we call it name on the UI, with various roles like author, painter, etc). You can see this in action here: http://www.nines.org/collex type in "da" into the name for example. I developed a custom request handler in Solr for returning these types of suggest interfaces complete with facet counts. My code is very specific to our fields, so its not usable in a general sense, but maybe this gives you some ideas on where to go with these large sets of facet values. Erik
Re: Facet Performance
: Unfortunately which strategy will be chosen is currently undocumented : and control is a bit oblique: If the field is tokenized or multivalued : or Boolean, the FilterQuery method will be used; otherwise the : FieldCache method. I expect I or others will improve that shortly. Bear in mind, what's provide out of the box is "SimpleFacets" ... it's designed to meet simple faceting needs ... when you start talking about 100s or thousands of constraints per facet, you are getting outside the scope of what it was intended to serve efficiently. At a certain point the only practical thing to do is write a custom request handler that makes the best choices for your data. For the record: a really simple patch someone could submit would be to make add an optional field based param indicating which type of faceting (termenum/fieldcache) should be used to generate the list of terms and then make SimpleFacets.getFacetFieldCounts use that and call the apprpriate method insteado calling getTermCounts -- that way you could force one or the other if you know it's better for your data/query. -Hoss
Re: Facet Performance
Erik Hatcher wrote: On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote: My data is 492,000 records of book data. I am faceting on 4 fields: author, subject, language, format. Format and language are fairly simple as their are only a few unique terms. Author and subject however are much different in that there are thousands of unique terms. When encountering difficult issues, I like to think in terms of the user interface. Surely you're not presenting 400k+ authors to the users in one shot. In Collex, we have put an AJAX drop-down that shows the author facet (we call it name on the UI, with various roles like author, painter, etc). You can see this in action here: In our data, we don't have unique authors for each records ... so let's say out of the 500,000 records ... we have 200,000 authors. What I am trying to display is the top 10 authors from the results of a search. So I do a search for title:"Gone with the wind" and I would like to see the top 10 matching authors from these results. But no worries, I have written my own facet handler and I am now back to under a second with faceting! Thanks for everyone's help and keep up the good work! Andrew
RE: facet performance tips
Jerome, Yes you need to increase the filterCache size to something close to unique number of facet elements. But also consider the RAM required to accommodate the increase. I did see a significant performance gain by increasing the filterCache size Thanks, Kalyan Manepalli -Original Message- From: Jérôme Etévé [mailto:jerome.et...@gmail.com] Sent: Wednesday, August 12, 2009 12:31 PM To: solr-user@lucene.apache.org Subject: facet performance tips Hi everyone, I'm using some faceting on a solr index containing ~ 160K documents. I perform facets on multivalued string fields. The number of possible different values is quite large. Enabling facets degrades the performance by a factor 3. Because I'm using solr 1.3, I guess the facetting makes use of the filter cache to work. My filterCache is set to a size of 2048. I also noticed in my solr stats a very small ratio of cache hit (~ 0.01%). Can it be the reason why the faceting is slow? Does it make sense to increase the filterCache size so it matches more or less the number of different possible values for the faceted fields? Would that not make the memory usage explode? Thanks for your help ! -- Jerome Eteve. Chat with me live at http://www.eteve.net jer...@eteve.net
RE: facet performance tips
I am currently faceting on tokenized multi-valued field at http://www.tokenizer.org (25 mlns simple docs) It uses some home-made quick fixes similar to SOLR-475 (SOLR-711) and non-synchronized cache (similar to LingPipe's FastCache, SOLR-665, SOLR-667) Average "faceting" on query results: 0.2 - 0.3 seconds; without those patches - 20-50 seconds. I am going to upgrade to SOLR-1.4 from trunk (with SOLR-475 & SOLR-667) and to compare results... P.S. Avoid faceting on a field with heavy distribution of terms (such as few millions of terms in my case); It won't work in SOLR 1.3. TIP: use non-tokenized single-valued field for faceting, such as non-tokenized "country" field. P.P.S. Would be nice to load/stress http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html against putting CPU in a spin loop ConcurrentHashMap. -Original Message- From: Erik Hatcher [mailto:ehatc...@apache.org] Sent: August-12-09 2:12 PM To: solr-user@lucene.apache.org Subject: Re: facet performance tips Yes, increasing the filterCache size will help with Solr 1.3 performance. Do note that trunk (soon Solr 1.4) has dramatically improved faceting performance. Erik On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: > Hi everyone, > > I'm using some faceting on a solr index containing ~ 160K documents. > I perform facets on multivalued string fields. The number of possible > different values is quite large. > > Enabling facets degrades the performance by a factor 3. > > Because I'm using solr 1.3, I guess the facetting makes use of the > filter cache to work. My filterCache is set > to a size of 2048. I also noticed in my solr stats a very small ratio > of cache hit (~ 0.01%). > > Can it be the reason why the faceting is slow? Does it make sense to > increase the filterCache size so it matches more or less the number > of different possible values for the faceted fields? Would that not > make the memory usage explode? > > Thanks for your help ! > > -- > Jerome Eteve. > > Chat with me live at http://www.eteve.net > > jer...@eteve.net
Re: facet performance tips
Yes, increasing the filterCache size will help with Solr 1.3 performance. Do note that trunk (soon Solr 1.4) has dramatically improved faceting performance. Erik On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: Hi everyone, I'm using some faceting on a solr index containing ~ 160K documents. I perform facets on multivalued string fields. The number of possible different values is quite large. Enabling facets degrades the performance by a factor 3. Because I'm using solr 1.3, I guess the facetting makes use of the filter cache to work. My filterCache is set to a size of 2048. I also noticed in my solr stats a very small ratio of cache hit (~ 0.01%). Can it be the reason why the faceting is slow? Does it make sense to increase the filterCache size so it matches more or less the number of different possible values for the faceted fields? Would that not make the memory usage explode? Thanks for your help ! -- Jerome Eteve. Chat with me live at http://www.eteve.net jer...@eteve.net
Re: facet performance tips
For your fields with many terms you may want to try Bobo http://code.google.com/p/bobo-browse/ which could work well with your case. On Wed, Aug 12, 2009 at 12:02 PM, Fuad Efendi wrote: > I am currently faceting on tokenized multi-valued field at > http://www.tokenizer.org (25 mlns simple docs) > > It uses some home-made quick fixes similar to SOLR-475 (SOLR-711) and > non-synchronized cache (similar to LingPipe's FastCache, SOLR-665, SOLR-667) > > Average "faceting" on query results: 0.2 - 0.3 seconds; without those > patches - 20-50 seconds. > > I am going to upgrade to SOLR-1.4 from trunk (with SOLR-475 & SOLR-667) and > to compare results... > > > > > P.S. > Avoid faceting on a field with heavy distribution of terms (such as few > millions of terms in my case); It won't work in SOLR 1.3. > > TIP: use non-tokenized single-valued field for faceting, such as > non-tokenized "country" field. > > > > P.P.S. > Would be nice to load/stress > http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html against > putting CPU in a spin loop ConcurrentHashMap. > > > > -Original Message----- > From: Erik Hatcher [mailto:ehatc...@apache.org] > Sent: August-12-09 2:12 PM > To: solr-user@lucene.apache.org > Subject: Re: facet performance tips > > Yes, increasing the filterCache size will help with Solr 1.3 > performance. > > Do note that trunk (soon Solr 1.4) has dramatically improved faceting > performance. > > Erik > > On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: > >> Hi everyone, >> >> I'm using some faceting on a solr index containing ~ 160K documents. >> I perform facets on multivalued string fields. The number of possible >> different values is quite large. >> >> Enabling facets degrades the performance by a factor 3. >> >> Because I'm using solr 1.3, I guess the facetting makes use of the >> filter cache to work. My filterCache is set >> to a size of 2048. I also noticed in my solr stats a very small ratio >> of cache hit (~ 0.01%). >> >> Can it be the reason why the faceting is slow? Does it make sense to >> increase the filterCache size so it matches more or less the number >> of different possible values for the faceted fields? Would that not >> make the memory usage explode? >> >> Thanks for your help ! >> >> -- >> Jerome Eteve. >> >> Chat with me live at http://www.eteve.net >> >> jer...@eteve.net > > > >
Re: facet performance tips
Note that depending on the profile of your field (full text and how many unique terms on average per document), the improvements from 1.4 may not apply, as you may exceed the limits of the new faceting technique in Solr 1.4. -Stephen On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher wrote: > Yes, increasing the filterCache size will help with Solr 1.3 performance. > > Do note that trunk (soon Solr 1.4) has dramatically improved faceting > performance. > >Erik > > > On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: > > Hi everyone, >> >> I'm using some faceting on a solr index containing ~ 160K documents. >> I perform facets on multivalued string fields. The number of possible >> different values is quite large. >> >> Enabling facets degrades the performance by a factor 3. >> >> Because I'm using solr 1.3, I guess the facetting makes use of the >> filter cache to work. My filterCache is set >> to a size of 2048. I also noticed in my solr stats a very small ratio >> of cache hit (~ 0.01%). >> >> Can it be the reason why the faceting is slow? Does it make sense to >> increase the filterCache size so it matches more or less the number >> of different possible values for the faceted fields? Would that not >> make the memory usage explode? >> >> Thanks for your help ! >> >> -- >> Jerome Eteve. >> >> Chat with me live at http://www.eteve.net >> >> jer...@eteve.net >> > > -- Stephen Duncan Jr www.stephenduncanjr.com
Re: facet performance tips
Thanks everyone for your advices. I increased my filterCache, and the faceting performances improved greatly. My faceted field can have at the moment ~4 different terms, so I did set a filterCache size of 5 and it works very well. However, I'm planning to increase the number of terms to maybe around 500 000, so I guess this approach won't work anymore, as I doubt a 500 000 sized fieldCache would work. So I guess my best move would be to upgrade to the soon to be 1.4 version of solr to benefit from its new faceting method. I know this is a bit off-topic, but do you have a rough idea about when 1.4 will be an official release? As well, is the current trunk OK for production? Is it compatible with 1.3 configuration files? Thanks ! Jerome. 2009/8/13 Stephen Duncan Jr : > Note that depending on the profile of your field (full text and how many > unique terms on average per document), the improvements from 1.4 may not > apply, as you may exceed the limits of the new faceting technique in Solr > 1.4. > -Stephen > > On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher wrote: > >> Yes, increasing the filterCache size will help with Solr 1.3 performance. >> >> Do note that trunk (soon Solr 1.4) has dramatically improved faceting >> performance. >> >>Erik >> >> >> On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: >> >> Hi everyone, >>> >>> I'm using some faceting on a solr index containing ~ 160K documents. >>> I perform facets on multivalued string fields. The number of possible >>> different values is quite large. >>> >>> Enabling facets degrades the performance by a factor 3. >>> >>> Because I'm using solr 1.3, I guess the facetting makes use of the >>> filter cache to work. My filterCache is set >>> to a size of 2048. I also noticed in my solr stats a very small ratio >>> of cache hit (~ 0.01%). >>> >>> Can it be the reason why the faceting is slow? Does it make sense to >>> increase the filterCache size so it matches more or less the number >>> of different possible values for the faceted fields? Would that not >>> make the memory usage explode? >>> >>> Thanks for your help ! >>> >>> -- >>> Jerome Eteve. >>> >>> Chat with me live at http://www.eteve.net >>> >>> jer...@eteve.net >>> >> >> > > > -- > Stephen Duncan Jr > www.stephenduncanjr.com > -- Jerome Eteve. Chat with me live at http://www.eteve.net jer...@eteve.net
RE: facet performance tips
I took 1.4 from trunk three days ago, it seems Ok for production (at least for my Master instance which is doing writes-only). I use the same config files. 500 000 terms are Ok too; I am using several millions with pre-1.3 SOLR taken from trunk. However, do not try to "facet" (probably outdated term after SOLR-475) on generic queries such as [* TO *] (with huge resultset). For smaller query results (100,000 instead of 100,000,000) "counting terms" is fast enough (few milliseconds at http://www.tokenizer.org) -Original Message- From: Jérôme Etévé [mailto:jerome.et...@gmail.com] Sent: August-13-09 5:38 AM To: solr-user@lucene.apache.org Subject: Re: facet performance tips Thanks everyone for your advices. I increased my filterCache, and the faceting performances improved greatly. My faceted field can have at the moment ~4 different terms, so I did set a filterCache size of 5 and it works very well. However, I'm planning to increase the number of terms to maybe around 500 000, so I guess this approach won't work anymore, as I doubt a 500 000 sized fieldCache would work. So I guess my best move would be to upgrade to the soon to be 1.4 version of solr to benefit from its new faceting method. I know this is a bit off-topic, but do you have a rough idea about when 1.4 will be an official release? As well, is the current trunk OK for production? Is it compatible with 1.3 configuration files? Thanks ! Jerome. 2009/8/13 Stephen Duncan Jr : > Note that depending on the profile of your field (full text and how many > unique terms on average per document), the improvements from 1.4 may not > apply, as you may exceed the limits of the new faceting technique in Solr > 1.4. > -Stephen > > On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher wrote: > >> Yes, increasing the filterCache size will help with Solr 1.3 performance. >> >> Do note that trunk (soon Solr 1.4) has dramatically improved faceting >> performance. >> >>Erik >> >> >> On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: >> >> Hi everyone, >>> >>> I'm using some faceting on a solr index containing ~ 160K documents. >>> I perform facets on multivalued string fields. The number of possible >>> different values is quite large. >>> >>> Enabling facets degrades the performance by a factor 3. >>> >>> Because I'm using solr 1.3, I guess the facetting makes use of the >>> filter cache to work. My filterCache is set >>> to a size of 2048. I also noticed in my solr stats a very small ratio >>> of cache hit (~ 0.01%). >>> >>> Can it be the reason why the faceting is slow? Does it make sense to >>> increase the filterCache size so it matches more or less the number >>> of different possible values for the faceted fields? Would that not >>> make the memory usage explode? >>> >>> Thanks for your help ! >>> >>> -- >>> Jerome Eteve. >>> >>> Chat with me live at http://www.eteve.net >>> >>> jer...@eteve.net >>> >> >> > > > -- > Stephen Duncan Jr > www.stephenduncanjr.com > -- Jerome Eteve. Chat with me live at http://www.eteve.net jer...@eteve.net
RE: facet performance tips
It seems BOBO-Browse is alternate faceting engine; would be interesting to compare performance with SOLR... Distributed? -Original Message- From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] Sent: August-12-09 6:12 PM To: solr-user@lucene.apache.org Subject: Re: facet performance tips For your fields with many terms you may want to try Bobo http://code.google.com/p/bobo-browse/ which could work well with your case.
RE: facet performance tips
Interesting, it has "BoboRequestHandler implements SolrRequestHandler" - easy to try it; and shards support [Fuad Efendi] It seems BOBO-Browse is alternate faceting engine; would be interesting to compare performance with SOLR... Distributed? [Jason Rutherglen] For your fields with many terms you may want to try Bobo http://code.google.com/p/bobo-browse/ which could work well with your case.
Re: facet performance tips
Yeah we need a performance comparison, I haven't had time to put one together. If/when I do I'll compare Bobo performance against Solr bitset intersection based facets, compare memory consumption. For near realtime Solr needs to cache and merge bitsets at the SegmentReader level, and Bobo needs to be upgraded to work with Lucene 2.9's searching at the segment level (currently it uses a MultiSearcher). Distributed search on either should be fairly straightforward? On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendi wrote: > It seems BOBO-Browse is alternate faceting engine; would be interesting to > compare performance with SOLR... Distributed? > > > -Original Message- > From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] > Sent: August-12-09 6:12 PM > To: solr-user@lucene.apache.org > Subject: Re: facet performance tips > > For your fields with many terms you may want to try Bobo > http://code.google.com/p/bobo-browse/ which could work well with your > case. > > > > >
RE: facet performance tips
SOLR-1.4-trunk uses terms counting instead of bitset intersects (seems to be); check this http://issues.apache.org/jira/browse/SOLR-475 (and probably http://issues.apache.org/jira/browse/SOLR-711) -Original Message- From: Jason Rutherglen Yeah we need a performance comparison, I haven't had time to put one together. If/when I do I'll compare Bobo performance against Solr bitset intersection based facets, compare memory consumption. For near realtime Solr needs to cache and merge bitsets at the SegmentReader level, and Bobo needs to be upgraded to work with Lucene 2.9's searching at the segment level (currently it uses a MultiSearcher). Distributed search on either should be fairly straightforward? On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendi wrote: > It seems BOBO-Browse is alternate faceting engine; would be interesting to > compare performance with SOLR... Distributed? > > > -Original Message- > From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] > Sent: August-12-09 6:12 PM > To: solr-user@lucene.apache.org > Subject: Re: facet performance tips > > For your fields with many terms you may want to try Bobo > http://code.google.com/p/bobo-browse/ which could work well with your > case. > > > > >
Re: facet performance tips
Right, I haven't used SOLR-475 yet and am more familiar with Bobo. I believe there are differences but I haven't gone into them yet. As I'm using Solr 1.4 now, maybe I'll test the UnInvertedField modality. Feel free to report back results as I don't think I've seen much yet? On Thu, Aug 13, 2009 at 10:51 AM, Fuad Efendi wrote: > SOLR-1.4-trunk uses terms counting instead of bitset intersects (seems to > be); check this > http://issues.apache.org/jira/browse/SOLR-475 > (and probably http://issues.apache.org/jira/browse/SOLR-711) > > -Original Message- > From: Jason Rutherglen > > Yeah we need a performance comparison, I haven't had time to put > one together. If/when I do I'll compare Bobo performance against > Solr bitset intersection based facets, compare memory > consumption. > > For near realtime Solr needs to cache and merge bitsets at the > SegmentReader level, and Bobo needs to be upgraded to work with > Lucene 2.9's searching at the segment level (currently it uses a > MultiSearcher). > > Distributed search on either should be fairly straightforward? > > On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendi wrote: >> It seems BOBO-Browse is alternate faceting engine; would be interesting to >> compare performance with SOLR... Distributed? >> >> >> -Original Message- >> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] >> Sent: August-12-09 6:12 PM >> To: solr-user@lucene.apache.org >> Subject: Re: facet performance tips >> >> For your fields with many terms you may want to try Bobo >> http://code.google.com/p/bobo-browse/ which could work well with your >> case. >> >> >> >> >> > > >
Re: Facet performance with heterogeneous 'facets'?
Just a little follow-up - I did a little more testing, and the query takes 20 seconds no matter what - If there's one document in the results set, or if I do a query that returns all 13 documents. It seems something isn't right... it looks like solr is doing faceted search on the whole index no matter what's the result set when doing facets on a string field. I must be doing something wrong? Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Michael Imbeault wrote: Been playing around with the news 'facets search' and it works very well, but it's really slow for some particular applications. I've been trying to use it to display the most frequent authors of articles; this is from a huge (15 millions articles) database and names of authors are rare and heterogeneous. On a query that takes (without facets) 0.1 seconds, it jumps to ~20 seconds with just 1% of the documents indexed (I've been getting java.lang.OutOfMemoryError with the full index). ~40 seconds for a faceted search on 2 (string) fields. Range queries on a slong field is more acceptable (even with a dozen of them, query time is still in the subsecond range). I'm I trying to do something which isn't what faceted search was made for? It would be understandable, after all, I guess the facets engine has to check very doc in the index and sort... which shouldn't yield good performance no matter what, sadly. Is there any other way I could achieve what I'm trying to do? Just a list of the most frequent (top 5) authors present in the results of a query. Thanks,
Re: Facet performance with heterogeneous 'facets'?
On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Been playing around with the news 'facets search' and it works very well, but it's really slow for some particular applications. I've been trying to use it to display the most frequent authors of articles I noticed this too, and have been thinking about ways to fix it. The root of the problem is that lucene, like all full-text search engines, uses inverted indicies. It's fast and easy to get all documents for a particular term, but getting all terms for a document documents is either not possible, or not fast (assuming many documents match a query). For cases like "author", if there is only one value per document, then a possible fix is to use the field cache. If there can be multiple occurrences, there doesn't seem to be a good way that preserves exact counts, except maybe if the number of documents matching a query is low. -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Just a little follow-up - I did a little more testing, and the query takes 20 seconds no matter what - If there's one document in the results set, or if I do a query that returns all 13 documents. Yes, currently the same strategy is always used. intersection_count(docs_matching_query, docs_matching_author1) intersection_count(docs_matching_query, docs_matching_author2) intersection_count(docs_matching_query, docs_matching_author3) etc... Normally, the docsets will be cached, but since the number of authors is greater than the size of the filtercache, the effective cache hit rate will be 0% -Yonik
Re: Facet performance with heterogeneous 'facets'?
Yonik Seeley wrote: I noticed this too, and have been thinking about ways to fix it. The root of the problem is that lucene, like all full-text search engines, uses inverted indicies. It's fast and easy to get all documents for a particular term, but getting all terms for a document documents is either not possible, or not fast (assuming many documents match a query). Yeah that's what I've been thinking; the index isn't built to handle such searches, sadly :( It would be very nice to be able to rapidly search by most frequent author, journal, etc. For cases like "author", if there is only one value per document, then a possible fix is to use the field cache. If there can be multiple occurrences, there doesn't seem to be a good way that preserves exact counts, except maybe if the number of documents matching a query is low. I have one value per document (I have fields for authors, last_author and first_author, and I'm doing faceted search on first and last authors fields). How would I use the field cache to fix my problem? Also, would it be better to store a unique number (for each possible author) in an int field along with the string, and do the faceted searching on the int field? Would this be faster / require less memory? I guess that yes, and I'll test that when I have the time. Just a little follow-up - I did a little more testing, and the query takes 20 seconds no matter what - If there's one document in the results set, or if I do a query that returns all 13 documents. Yes, currently the same strategy is always used. intersection_count(docs_matching_query, docs_matching_author1) intersection_count(docs_matching_query, docs_matching_author2) intersection_count(docs_matching_query, docs_matching_author3) etc... Normally, the docsets will be cached, but since the number of authors is greater than the size of the filtercache, the effective cache hit rate will be 0% -Yonik So more memory would fix the problem? Also, I was under the impression that it was only searching / sorting for authors that it knows are in the result set... in the case of only one document (1 result), it seems strange that it takes the same time as for 130 000 results. It should just check the results, see that there's only one author, and return that? And in the case of 2 documents, just sort 2 authors (or 1 if they're the same)? I understand your answer (it does intersections), but I wonder why its intersecting from the whole document set at first, and not docs_matching_query like you said. Thanks for the support, Michael
Re: Facet performance with heterogeneous 'facets'?
Another followup: I bumped all the caches in solrconfig.xml to size="1600384" initialSize="400096" autowarmCount="400096" It seemed to fix the problem on a very small index (facets on last and first author fields, + 12 range date facets, sub 0.3 seconds for queries). I'll check on the full index tomorrow (it's indexing right now, 400docs/sec!). However, I still don't have an idea what are these values representing, and how I should estimate what values I should set them to. Originally I thought it was the size of the cache in kb, and someone on the list told me it was number of items, but I don't quite get it. Better documentation on that would be welcomed :) Also, is there any plans to add an option not to run a facet search if the result set is too big? To avoid 40 seconds queries if the docset is too large... Thanks, Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Just a little follow-up - I did a little more testing, and the query takes 20 seconds no matter what - If there's one document in the results set, or if I do a query that returns all 13 documents. Yes, currently the same strategy is always used. intersection_count(docs_matching_query, docs_matching_author1) intersection_count(docs_matching_query, docs_matching_author2) intersection_count(docs_matching_query, docs_matching_author3) etc... Normally, the docsets will be cached, but since the number of authors is greater than the size of the filtercache, the effective cache hit rate will be 0% -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Yonik Seeley wrote: > For cases like "author", if there is only one value per document, then > a possible fix is to use the field cache. If there can be multiple > occurrences, there doesn't seem to be a good way that preserves exact > counts, except maybe if the number of documents matching a query is > low. > I have one value per document (I have fields for authors, last_author and first_author, and I'm doing faceted search on first and last authors fields). How would I use the field cache to fix my problem? Unless you want to dive into Solr development, you don't :-) It requires extensive changes to the faceting code and doing things a different way in some cases. The FieldCache is the fastest way to "uninvert" single valued fields... it's currently only used for Sorting, where one needs to quickly know the field value given the document id. The downside is high memory use, and that it's not a general solution... it can't handle fields with multiple tokens (tokenized fields or multi-valued fields). So the strategy would be to step through the documents, get the value for the field from the FieldCache, increment a counter for that value, then find the top counters when we are done. Also, would it be better to store a unique number (for each possible author) in an int field along with the string, and do the faceted searching on the int field? It won't really help. It wouldn't be faster, and it would require only slightly less memory. >> Just a little follow-up - I did a little more testing, and the query >> takes 20 seconds no matter what - If there's one document in the results >> set, or if I do a query that returns all 13 documents. > > Yes, currently the same strategy is always used. > intersection_count(docs_matching_query, docs_matching_author1) > intersection_count(docs_matching_query, docs_matching_author2) > intersection_count(docs_matching_query, docs_matching_author3) > etc... > > Normally, the docsets will be cached, but since the number of authors > is greater than the size of the filtercache, the effective cache hit > rate will be 0% > > -Yonik So more memory would fix the problem? Yes, if your collection size isn't that large... it's not a practical solution for many cases though. Also, I was under the impression that it was only searching / sorting for authors that it knows are in the result set... That's the problem... it's not necessarily easy to know *what* authors are in the result set. If we could quickly determine that, we could just count them and not do any intersections or anything at all. in the case of only one document (1 result), it seems strange that it takes the same time as for 130 000 results. It should just check the results, see that there's only one author, and return that? And in the case of 2 documents, just sort 2 authors (or 1 if they're the same)? I understand your answer (it does intersections), but I wonder why its intersecting from the whole document set at first, and not docs_matching_query like you said. It is just intersecting docs_matching_query. The problem is that it's intersecting that set with all possible author sets since it doesn't know ahead of time what authors are in the docs that match the query. There could be optimizations when docs_matching_query.size() is small, so we start somehow with terms in the documents rather than terms in the index. That requires termvectors to be stored (medium speed), or requires that the field be stored and that we re-analyze it (very slow). More optimization of special cases hasn't been done simply because no one has done it yet... (as you note, faceting is a new feature). -Yonik
Re: Facet performance with heterogeneous 'facets'?
Michael Imbeault wrote: Also, is there any plans to add an option not to run a facet search if the result set is too big? To avoid 40 seconds queries if the docset is too large... You could run one query with facet=false, check the result size and then run it again (should be fast because it is cached) with facet=true&rows=0 to get facet results only. I would think that the decision to run/not run facets would be highly custom to your collection and not easily developed as a configurable feature. --Joachim
Re: Facet performance with heterogeneous 'facets'?
I just updated the comments in solrconfig.xml: On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Another followup: I bumped all the caches in solrconfig.xml to size="1600384" initialSize="400096" autowarmCount="400096" It seemed to fix the problem on a very small index (facets on last and first author fields, + 12 range date facets, sub 0.3 seconds for queries). I'll check on the full index tomorrow (it's indexing right now, 400docs/sec!). However, I still don't have an idea what are these values representing, and how I should estimate what values I should set them to. Originally I thought it was the size of the cache in kb, and someone on the list told me it was number of items, but I don't quite get it. Better documentation on that would be welcomed :) Also, is there any plans to add an option not to run a facet search if the result set is too big? To avoid 40 seconds queries if the docset is too large... I'd like to speed up certain corner cases, but you can always set timeouts in whatever frontend is making the request to Solr too. -Yonik
Re: Facet performance with heterogeneous 'facets'?
Quick Question: did you say you are faceting on the first name field seperately from the last name field? ... why? You'll probably see a sharp increase in performacne if you have a single untokenized author field containing hte full name and you facet on that -- there will be a lot less unique terms to use when computing DocSets and intersections. Second: you mentioned increasing hte size of your filterCache significantly, but we don't really know how heterogenous your index is ... once you made that cahnge did your filterCache hitrate increase? .. do you have any evictions (you can check on the "Statistics" patge) : > Also, I was under the impression : > that it was only searching / sorting for authors that it knows are in : > the result set... : : That's the problem... it's not necessarily easy to know *what* authors : are in the result set. If we could quickly determine that, we could : just count them and not do any intersections or anything at all. another way to look at it is that by looking at all the authors, the work done for generating the facet counts for query A can be completely reused for the next query B -- presuming your filterCache is large enough to hold all of the author filters. : There could be optimizations when docs_matching_query.size() is small, : so we start somehow with terms in the documents rather than terms in : the index. That requires termvectors to be stored (medium speed), or : requires that the field be stored and that we re-analyze it (very : slow). : : More optimization of special cases hasn't been done simply because no : one has done it yet... (as you note, faceting is a new feature). the optimization optimization i anticipated from teh begining, would probably be usefull in the situation Michael is describing ... if there is a "long tail" oif authors (and in my experience, there typically is) we can cache an ordered list of the top N most prolific authors, along with the count of how many documents they have in the index (this info is easy to getfrom TermEnum.docFreq). when we facet on the authors, we start with that list and go in order, generating their facet constraint count using the DocSet intersection just like we currently do ... if we reach our facet.limit before we reach the end of hte list and the lowest constraint count is higher then the total doc count of the last author in the list, then we know we don't need to bother testing any other Author, because no other author an possibly have a higher facet constraint count then the ones on our list (since they haven't even written that many documents) -Hoss
Re: Facet performance with heterogeneous 'facets'?
On 9/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: Quick Question: did you say you are faceting on the first name field seperately from the last name field? ... why? You'll probably see a sharp increase in performacne if you have a single untokenized author field containing hte full name and you facet on that -- there will be a lot less unique terms to use when computing DocSets and intersections. Second: you mentioned increasing hte size of your filterCache significantly, but we don't really know how heterogenous your index is ... once you made that cahnge did your filterCache hitrate increase? .. do you have any evictions (you can check on the "Statistics" patge) : > Also, I was under the impression : > that it was only searching / sorting for authors that it knows are in : > the result set... : : That's the problem... it's not necessarily easy to know *what* authors : are in the result set. If we could quickly determine that, we could : just count them and not do any intersections or anything at all. another way to look at it is that by looking at all the authors, the work done for generating the facet counts for query A can be completely reused for the next query B -- presuming your filterCache is large enough to hold all of the author filters. : There could be optimizations when docs_matching_query.size() is small, : so we start somehow with terms in the documents rather than terms in : the index. That requires termvectors to be stored (medium speed), or : requires that the field be stored and that we re-analyze it (very : slow). : : More optimization of special cases hasn't been done simply because no : one has done it yet... (as you note, faceting is a new feature). the optimization optimization i anticipated from teh begining, would probably be usefull in the situation Michael is describing ... if there is a "long tail" oif authors (and in my experience, there typically is) we can cache an ordered list of the top N most prolific authors, along with the count of how many documents they have in the index (this info is easy to getfrom TermEnum.docFreq). Yeah, I've thought about a fieldInfoCache too. It could also cache the total number of terms in order to make decisions about what faceting strategy to follow. when we facet on the authors, we start with that list and go in order, generating their facet constraint count using the DocSet intersection just like we currently do ... if we reach our facet.limit before we reach the end of hte list and the lowest constraint count is higher then the total doc count of the last author in the list, then we know we don't need to bother testing any other Author, because no other author an possibly have a higher facet constraint count then the ones on our list This works OK if the intersection counts are high (as a percentage of the facet sets). I'm not sure how often this will be the case though. Another tradeoff is to allow getting inexact counts with multi-token fields by: - simply faceting on the most popular values OR - do some sort of statistical sampling by reading term vectors for a fraction of the matching docs. -Yonik
Re: Facet performance with heterogeneous 'facets'?
: > when we facet on the authors, we start with : > that list and go in order, generating their facet constraint count using : > the DocSet intersection just like we currently do ... if we reach our : > facet.limit before we reach the end of hte list and the lowest constraint : > count is higher then the total doc count of the last author in the list, : > then we know we don't need to bother testing any other Author, because no : > other author an possibly have a higher facet constraint count then the : > ones on our list : : This works OK if the intersection counts are high (as a percentage of : the facet sets). I'm not sure how often this will be the case though. well, keep in mind "N" could be very big, big enough to store the full list of Terms sorted in docFreq order (it shouldn't take up much space since it's just hte Term and an int)e ... for any query that returns a "large" number of results, you probably won't need to reach the end of the list before you can tell that all the remaining Terms have a lower docFreq then the current last constraint count in your facet.limit list. For queries that return a "small" number of results, it wouldn't be as usefull, but thats where a switch could be fliped to start with the values mapped to hte docs (using FieldCache -- assuming single-value fields) : Another tradeoff is to allow getting inexact counts with multi-token fields by: : - simply faceting on the most popular values :OR : - do some sort of statistical sampling by reading term vectors for a : fraction of the matching docs. i loath inexact counts ... i think of them as "Astrology" to the Astronomy of true Faceted Searching ... but i'm sure they would be "good enough" for some peoples use cases. -Hoss
Re: Facet performance with heterogeneous 'facets'?
: I just updated the comments in solrconfig.xml: I've tweaked the SolrCaching wiki page to include some of this info as well, feel free to add any additional info you think would be helpful to other people (or ask any qestions about it if any of it still doesn't seem clear to you)... http://wiki.apache.org/solr/SolrCaching : > now, 400docs/sec!). However, I still don't have an idea what are these : > values representing, and how I should estimate what values I should set : > them to. Originally I thought it was the size of the cache in kb, and : > someone on the list told me it was number of items, but I don't quite : > get it. Better documentation on that would be welcomed :) -Hoss
Re: Facet performance with heterogeneous 'facets'?
Thanks for all the great answers. Quick Question: did you say you are faceting on the first name field seperately from the last name field? ... why? You misunderstood. I'm doing faceting on first author, and last author of the list. Life science papers have authors list, and the first one is usually the guy who did most of the work, and the last one is usually the boss of the lab. I already have untokenized author fields for that using copyField. Second: you mentioned increasing hte size of your filterCache significantly, but we don't really know how heterogenous your index is ... once you made that cahnge did your filterCache hitrate increase? .. do you have any evictions (you can check on the "Statistics" page) It was at the default (16000) and it hit the ceiling so to speak. I did maxSize=1600 (for testing purpose) and now size : 17038 and 0 evictions. For a single facet field (journal name) with a limit of 5 and 12 faceted query fields (range on publication date), I now have 0.5 seconds search, which is not too bad. The filtercache size is pretty much constant no matter how many queries I do. However, if I try to add another facet field (such as first_author), something strange happens. 99% CPU, the filter cache is filling up really fast, hitratio goes to hell, no disk activity, and it can stay that way for at least 30 minutes (didn't test longer, no point really). It turns out that journal_name has 17038 different tokens, which is manageable, but first_author has > 400 000. I don't think this will ever yield good performance, so i might only do journal_name facets. Any reasons why facets tries to preload every term in the field? I have noticed that facets are not cached. Facets off, cached query take 0.01 seconds. Facet on, uncached and cached queries take 0.7 seconds. Any plans for a facets cache? I know that facets is still a very early feature, but its already awesome; my application is maybe irrealistic. Thanks, Michael
Re: Facet performance with heterogeneous 'facets'?
On 9/21/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: It turns out that journal_name has 17038 different tokens, which is manageable, but first_author has > 400 000. I don't think this will ever yield good performance, so i might only do journal_name facets. Hang in there Michael, a fix is on the way for your scenario (and subscribe to solr-dev if you want to stay on the bleeding edge): http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html -Yonik
Re: Facet performance with heterogeneous 'facets'?
Dude, stop being so awesome (and the whole Solr team). Seriously! Every problem / request (MoreLikeThis class, change AND/OR preference programatically, etc) I've submitted to this mailing list has received a quick, more-than-I-ever-expected answer. I'll subscribe to the dev list (been reading it off and on), but I'm afraid I couldn't code my way of a paper bag in Java. I'll contribute to the Solr wiki (the SolrPHP part in particular) as soon as I can. Thats the least I can do! Btw, Any plans for a facets cache? Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: On 9/21/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: It turns out that journal_name has 17038 different tokens, which is manageable, but first_author has > 400 000. I don't think this will ever yield good performance, so i might only do journal_name facets. Hang in there Michael, a fix is on the way for your scenario (and subscribe to solr-dev if you want to stay on the bleeding edge): http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/21/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Btw, Any plans for a facets cache? Maybe a partial one (like caching top terms to implement some other optimizations). My general philosophy on caching in Solr has been to cache things the client can't: elemental things, or *parts* of requests to make many different requests faster (most bang-for-the-buck). Caching complete requests/responses is generally less useful since it requires even more memory, has a worse hit ratio, and can be done anyway by the client or a separate process like squid. -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: Hang in there Michael, a fix is on the way for your scenario (and subscribe to solr-dev if you want to stay on the bleeding edge): OK, the optimization has been checked in. You can checkout from svn and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT). I'd be interested in hearing your results with it. The first facet request on a field will take longer than subsequent ones because the FieldCache entry is loaded on demand. You can use a firstSearcher/newSearcher hook in solrconfig.xml to send a facet request so that a real user would never see this slower query. -Yonik
Re: Facet performance with heterogeneous 'facets'?
I upgraded to the most recent Solr build (9-22) and sadly it's still really slow. 800 seconds query with a single facet on first_author, 15 millions documents total, the query return 180. Maybe i'm doing something wrong? Also, this is on my personal desktop; not on a server. Still, I'm getting 0.1 seconds queries without facets, so I don't think thats the cause. In the admin panel i can still see the filtercache doing millions of lookups (and tons of evictions once it hits the maxsize). Here's the field i'm using in schema.xml : This is the query : q="hiv red blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false I'll do more testing on the weekend, Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: Hang in there Michael, a fix is on the way for your scenario (and subscribe to solr-dev if you want to stay on the bleeding edge): OK, the optimization has been checked in. You can checkout from svn and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT). I'd be interested in hearing your results with it. The first facet request on a field will take longer than subsequent ones because the FieldCache entry is loaded on demand. You can use a firstSearcher/newSearcher hook in solrconfig.xml to send a facet request so that a real user would never see this slower query. -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/22/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: I upgraded to the most recent Solr build (9-22) and sadly it's still really slow. 800 seconds query with a single facet on first_author, 15 millions documents total, the query return 180. Maybe i'm doing something wrong? Also, this is on my personal desktop; not on a server. Still, I'm getting 0.1 seconds queries without facets, so I don't think thats the cause. In the admin panel i can still see the filtercache doing millions of lookups (and tons of evictions once it hits the maxsize). The fact that you see all the filtercache usage means that the optimization didn't kick in for some reason. Here's the field i'm using in schema.xml : That looks fine... This is the query : q="hiv red blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false That looks OK too. I assume that you didn't change the fieldtype definition for "string", and that the schema has version="1.1"? Before 1.1, all fields were assumed to be multiValued (there was no checking or enforcement). -Yonik
Re: Facet performance with heterogeneous 'facets'?
Excellent news; as you guessed, my schema was (for some reason) set to version 1.0. This also caused some of the problems I had with the original SolrPHP (parsing the wrong response). But better yet, the 800 seconds query is now running in 0.5-2 seconds! Amazing optimization! I can now do faceting on journal title (17 000 different titles) and last author (>400 000 authors), + 12 date range queries, in a very reasonable time (considering im on a test windows desktop box and not a server). The only problem is if I add first author, I get a java.lang.OutOfMemoryError: Java heap space. I'm sure this problem will get away on a server with more than the current 500 megs I can allocate to Tomcat. Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: On 9/22/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: I upgraded to the most recent Solr build (9-22) and sadly it's still really slow. 800 seconds query with a single facet on first_author, 15 millions documents total, the query return 180. Maybe i'm doing something wrong? Also, this is on my personal desktop; not on a server. Still, I'm getting 0.1 seconds queries without facets, so I don't think thats the cause. In the admin panel i can still see the filtercache doing millions of lookups (and tons of evictions once it hits the maxsize). The fact that you see all the filtercache usage means that the optimization didn't kick in for some reason. Here's the field i'm using in schema.xml : That looks fine... This is the query : q="hiv red blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false That looks OK too. I assume that you didn't change the fieldtype definition for "string", and that the schema has version="1.1"? Before 1.1, all fields were assumed to be multiValued (there was no checking or enforcement). -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/22/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Excellent news; as you guessed, my schema was (for some reason) set to version 1.0. Yeah, I just realized that having "version" right next to "name" would lead people to think it's "their" version number, when it's really Solr's version number. I've added a comment to the example schema to clarify that. But better yet, the 800 seconds query is now running in 0.5-2 seconds! Amazing optimization! I can now do faceting on journal title (17 000 different titles) and last author (>400 000 authors), + 12 date range queries, in a very reasonable time (considering im on a test windows desktop box and not a server). The only problem is if I add first author, I get a java.lang.OutOfMemoryError: Java heap space. I'm sure this problem will get away on a server with more than the current 500 megs I can allocate to Tomcat. Yes, the Lucene FieldCache takes up a lot of memory. It basically holds the entire field in a non-inverted form: http://lucene.apache.org/java/docs/api/org/apache/lucene/search/FieldCache.StringIndex.html It's currently also used for sorting, which also needs fast document->fieldvalue lookups, rather than the inverted term->documents_containing_that_term -Yonik