Re: Filter cache pollution during sharded edismax queries
On 01/10/2014 09:55, jim ferenczi wrote: I think you should test with facet.shard.limit=-1 this will disallow the limit for the facet on the shards and remove the needs for facet refinements. I bet that returning every facet with a count greater than 0 on internal queries is cheaper than using the filter cache to handle a lot of refinements. I'm happy to report that in our case setting facet.limit=-1 has a significant impact on performance, cache hit ratios and reduced CPU load. Thanks to all who replied! Cheers Charlie Flax Jim 2014-10-01 10:24 GMT+02:00 Charlie Hull char...@flax.co.uk: On 30/09/2014 22:25, Erick Erickson wrote: Just from a 20,000 ft. view, using the filterCache this way seems...odd. +1 for using a different cache, but that's being quite unfamiliar with the code. Here's a quick update: 1. LFUCache performs worse so we returned to LRUCache 2. Making the cache smaller than the default 512 reduced performance. 3. Raising the cache size to 2048 didn't seem to have a significant effect on performance but did reduce CPU load significantly. This may help our client as they can reduce their system spec considerably. We're continuing to test with our client, but the upshot is that even if you think you don't need the filter cache, if you're doing distributed faceting you probably do, and you should size it based on experimentation. In our case there is a single filter but the cache needs to be considerably larger than that! Cheers Charlie On Tue, Sep 30, 2014 at 1:53 PM, Alan Woodward a...@flax.co.uk wrote: Once all the facets have been gathered, the co-ordinating node then asks the subnodes for an exact count for the final top-N facets, What's the point to refine these counts? I've thought that it make sense only for facet.limit ed requests. Is it correct statement? can those who suffer from the low performance, just unlimit facet.limit to avoid that distributed hop? Presumably yes, but if you've got a sufficiently high cardinality field then any gains made by missing out the hop will probably be offset by having to stream all the return values back again. Alan -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Filter cache pollution during sharded edismax queries
On 30/09/2014 22:25, Erick Erickson wrote: Just from a 20,000 ft. view, using the filterCache this way seems...odd. +1 for using a different cache, but that's being quite unfamiliar with the code. Here's a quick update: 1. LFUCache performs worse so we returned to LRUCache 2. Making the cache smaller than the default 512 reduced performance. 3. Raising the cache size to 2048 didn't seem to have a significant effect on performance but did reduce CPU load significantly. This may help our client as they can reduce their system spec considerably. We're continuing to test with our client, but the upshot is that even if you think you don't need the filter cache, if you're doing distributed faceting you probably do, and you should size it based on experimentation. In our case there is a single filter but the cache needs to be considerably larger than that! Cheers Charlie On Tue, Sep 30, 2014 at 1:53 PM, Alan Woodward a...@flax.co.uk wrote: Once all the facets have been gathered, the co-ordinating node then asks the subnodes for an exact count for the final top-N facets, What's the point to refine these counts? I've thought that it make sense only for facet.limit ed requests. Is it correct statement? can those who suffer from the low performance, just unlimit facet.limit to avoid that distributed hop? Presumably yes, but if you've got a sufficiently high cardinality field then any gains made by missing out the hop will probably be offset by having to stream all the return values back again. Alan -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Filter cache pollution during sharded edismax queries
I think you should test with facet.shard.limit=-1 this will disallow the limit for the facet on the shards and remove the needs for facet refinements. I bet that returning every facet with a count greater than 0 on internal queries is cheaper than using the filter cache to handle a lot of refinements. Jim 2014-10-01 10:24 GMT+02:00 Charlie Hull char...@flax.co.uk: On 30/09/2014 22:25, Erick Erickson wrote: Just from a 20,000 ft. view, using the filterCache this way seems...odd. +1 for using a different cache, but that's being quite unfamiliar with the code. Here's a quick update: 1. LFUCache performs worse so we returned to LRUCache 2. Making the cache smaller than the default 512 reduced performance. 3. Raising the cache size to 2048 didn't seem to have a significant effect on performance but did reduce CPU load significantly. This may help our client as they can reduce their system spec considerably. We're continuing to test with our client, but the upshot is that even if you think you don't need the filter cache, if you're doing distributed faceting you probably do, and you should size it based on experimentation. In our case there is a single filter but the cache needs to be considerably larger than that! Cheers Charlie On Tue, Sep 30, 2014 at 1:53 PM, Alan Woodward a...@flax.co.uk wrote: Once all the facets have been gathered, the co-ordinating node then asks the subnodes for an exact count for the final top-N facets, What's the point to refine these counts? I've thought that it make sense only for facet.limit ed requests. Is it correct statement? can those who suffer from the low performance, just unlimit facet.limit to avoid that distributed hop? Presumably yes, but if you've got a sufficiently high cardinality field then any gains made by missing out the hop will probably be offset by having to stream all the return values back again. Alan -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Filter cache pollution during sharded edismax queries
: +1 for using a different cache, but that's being quite unfamiliar with the : code. in (a) common case, people tend to drill down and filter on facet constraints -- so using a special purpose cache for the refinements would result in redundent caching of the same info in multiple places. : What's the point to refine these counts? I've thought that it make sense : only for facet.limit ed requests. Is it correct statement? can those who refinement only happens if facet.limit is used and there are eligable top constraints that were not returned by some shards. : suffer from the low performance, just unlimit facet.limit to avoid that : distributed hop? As noted, setting facet.limit=-1 might help for low cardinality fields to ensure that every shard returns a count for every value and no-refinement is needed, but that doesn't really help you for fields with unknown/unbounded cardinality. As part of the distributed pivot faceting work, the amount of overrequest done in phase 1 (for both facet.pivot facet.field) was made configurable via 2 new parameters... https://lucene.apache.org/solr/4_10_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_RATIO https://lucene.apache.org/solr/4_10_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_COUNT ...so depending on the distribution of your data, you might find that by adjusting those values to increase the amount of overrequesting done, you can decrease the amount of refinement needed -- but there are obviously tradeoffs. -Hoss http://www.lucidworks.com/
RE: Filter cache pollution during sharded edismax queries
From: Charlie Hull [char...@flax.co.uk]: We've just found a very similar issue at a client installation. They have around 27 million documents and are faceting on fields with high cardinality, and are unhappy with query performance and the server hardware necessary to make this performance acceptable. I have done some testing on distributed non-pivot faceting and found that the fine-counting of the top-X terms can be very expensive for some queries. It seems that for fc-faceting with Strings it is markedly faster (and non-filter-cache-blowing) to do a standard faceting call and extract the relevant term counts for fine-counting instead of processing the requested terms one at a time. It seems that the same principle might apply to pivot faceting. There's a write-up with graphs at http://sbdevel.wordpress.com/2014/08/26/ten-times-faster/ - Toke Eskildsen
Re: Filter cache pollution during sharded edismax queries
Hoss, Nice to hear you! I wonder if there is a sequence chart, or maybe a deck, which explains the whole picture of distributed search, especially these ones? If it hasn't been presented to community so far, I'm aware of one conference which can accept such talk. WDYT? On Wed, Oct 1, 2014 at 9:17 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : +1 for using a different cache, but that's being quite unfamiliar with the : code. in (a) common case, people tend to drill down and filter on facet constraints -- so using a special purpose cache for the refinements would result in redundent caching of the same info in multiple places. : What's the point to refine these counts? I've thought that it make sense : only for facet.limit ed requests. Is it correct statement? can those who refinement only happens if facet.limit is used and there are eligable top constraints that were not returned by some shards. : suffer from the low performance, just unlimit facet.limit to avoid that : distributed hop? As noted, setting facet.limit=-1 might help for low cardinality fields to ensure that every shard returns a count for every value and no-refinement is needed, but that doesn't really help you for fields with unknown/unbounded cardinality. As part of the distributed pivot faceting work, the amount of overrequest done in phase 1 (for both facet.pivot facet.field) was made configurable via 2 new parameters... https://lucene.apache.org/solr/4_10_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_RATIO https://lucene.apache.org/solr/4_10_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_COUNT ...so depending on the distribution of your data, you might find that by adjusting those values to increase the amount of overrequesting done, you can decrease the amount of refinement needed -- but there are obviously tradeoffs. -Hoss http://www.lucidworks.com/ -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Filter cache pollution during sharded edismax queries
Hi, We've just found a very similar issue at a client installation. They have around 27 million documents and are faceting on fields with high cardinality, and are unhappy with query performance and the server hardware necessary to make this performance acceptable. Last night we noticed the filter cache had a pretty low hit rate and seemed to be filling up with many unexpected items (we were testing with only a *single* actual filter query). Diagnosing this with the showItems flag set on the Solr admin statistics we could see entries relating to facets, even though we were sure we were using the default facet.method=fc setting that should prevent filters being constructed. We're thus seeing similar cache pollution to Ken and Anca. We're trying a different type of cache (LFUCache) now and also may try tweaking cache sizes to try and help, as the filter creation seems to be something we can't easily get round. cheers Charlie Flax www.flax.co.uk On 18 October 2013 14:32, Anca Kopetz anca.kop...@kelkoo.com wrote: Hi Ken, Have you managed to find out why these entries were stored into filterCache and if they have an impact on the hit ratio ? We noticed the same problem, there are entries of this type : item_+(+(title:western^10.0 | ... in our filterCache. Thanks, Anca On 07/02/2013 09:01 PM, Ken Krugler wrote: Hi all, After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had dropped significantly. Previously it was at 95+%, but now it's 50%. I enabled recording 100 entries for debugging, and in looking at them it seems that edismax (and faceting) is creating entries for me. This is in a sharded setup, so it's a distributed search. If I do a search for the string bogus text using edismax on two fields, I get an entry in each of the shard's filter caches that looks like: item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2): Is this expected? I have a similar situation happening during faceted search, even though my fields are single-value/untokenized strings, and I'm not using the enum facet method. But I'll get many, many entries in the filterCache for facet values, and they all look like item_facet field:facet value: The net result of the above is that even with a very big filterCache size of 2K, the hit ratio is still only 60%. Thanks for any insights, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Filter cache pollution during sharded edismax queries
A bit of digging show that the extra entries in the filter cache are added when getting facets from a distributed search. Once all the facets have been gathered, the co-ordinating node then asks the subnodes for an exact count for the final top-N facets, and the path for executing this goes though: SimpleFacets.getListedTermCounts() -- SolrIndexSearcher.numDocs() -- SolrIndexSearcher.getPositiveDocSet() and this last method caches results in the filter cache. Maybe these should be using a separate cache? Alan Woodward www.flax.co.uk On 30 Sep 2014, at 11:38, Charlie Hull wrote: Hi, We've just found a very similar issue at a client installation. They have around 27 million documents and are faceting on fields with high cardinality, and are unhappy with query performance and the server hardware necessary to make this performance acceptable. Last night we noticed the filter cache had a pretty low hit rate and seemed to be filling up with many unexpected items (we were testing with only a *single* actual filter query). Diagnosing this with the showItems flag set on the Solr admin statistics we could see entries relating to facets, even though we were sure we were using the default facet.method=fc setting that should prevent filters being constructed. We're thus seeing similar cache pollution to Ken and Anca. We're trying a different type of cache (LFUCache) now and also may try tweaking cache sizes to try and help, as the filter creation seems to be something we can't easily get round. cheers Charlie Flax www.flax.co.uk On 18 October 2013 14:32, Anca Kopetz anca.kop...@kelkoo.com wrote: Hi Ken, Have you managed to find out why these entries were stored into filterCache and if they have an impact on the hit ratio ? We noticed the same problem, there are entries of this type : item_+(+(title:western^10.0 | ... in our filterCache. Thanks, Anca On 07/02/2013 09:01 PM, Ken Krugler wrote: Hi all, After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had dropped significantly. Previously it was at 95+%, but now it's 50%. I enabled recording 100 entries for debugging, and in looking at them it seems that edismax (and faceting) is creating entries for me. This is in a sharded setup, so it's a distributed search. If I do a search for the string bogus text using edismax on two fields, I get an entry in each of the shard's filter caches that looks like: item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2): Is this expected? I have a similar situation happening during faceted search, even though my fields are single-value/untokenized strings, and I'm not using the enum facet method. But I'll get many, many entries in the filterCache for facet values, and they all look like item_facet field:facet value: The net result of the above is that even with a very big filterCache size of 2K, the hit ratio is still only 60%. Thanks for any insights, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Filter cache pollution during sharded edismax queries
On 9/30/2014 4:38 AM, Charlie Hull wrote: We've just found a very similar issue at a client installation. They have around 27 million documents and are faceting on fields with high cardinality, and are unhappy with query performance and the server hardware necessary to make this performance acceptable. Last night we noticed the filter cache had a pretty low hit rate and seemed to be filling up with many unexpected items (we were testing with only a *single* actual filter query). Diagnosing this with the showItems flag set on the Solr admin statistics we could see entries relating to facets, even though we were sure we were using the default facet.method=fc setting that should prevent filters being constructed. We're thus seeing similar cache pollution to Ken and Anca. We're trying a different type of cache (LFUCache) now and also may try tweaking cache sizes to try and help, as the filter creation seems to be something we can't easily get round. Since I was the one who wrote the current LFUCache implementation you'll find in Solr, I can tell you the implementation is very naive. It correctly implements LFU, but it does so in a beginning programming student way. To decide which entry to evict, it must basically sort the list by the number of times each entry has been used. Because that number can continually change on each entry, that sort must be done every time an eviction must happen. Unless the cache size is very small, I would not expect the performance to be very good when the cache gets full and it must decide which entries to evict. I don't know what number qualifies as very small ... I'm not sure I'd go above 32 or 64. As the size goes up, the performance of adding a new entry to a full cache will go down. I've got a very efficient new cache implementation in Jira, but haven't had the time to devote to getting it polished and committed. Thanks, Shawn
Re: Filter cache pollution during sharded edismax queries
Hello, I already saw such discussion, but want to confirm. On Tue, Sep 30, 2014 at 2:59 PM, Alan Woodward a...@flax.co.uk wrote: Once all the facets have been gathered, the co-ordinating node then asks the subnodes for an exact count for the final top-N facets, What's the point to refine these counts? I've thought that it make sense only for facet.limit ed requests. Is it correct statement? can those who suffer from the low performance, just unlimit facet.limit to avoid that distributed hop? Thanks -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Filter cache pollution during sharded edismax queries
Once all the facets have been gathered, the co-ordinating node then asks the subnodes for an exact count for the final top-N facets, What's the point to refine these counts? I've thought that it make sense only for facet.limit ed requests. Is it correct statement? can those who suffer from the low performance, just unlimit facet.limit to avoid that distributed hop? Presumably yes, but if you've got a sufficiently high cardinality field then any gains made by missing out the hop will probably be offset by having to stream all the return values back again. Alan -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Filter cache pollution during sharded edismax queries
Just from a 20,000 ft. view, using the filterCache this way seems...odd. +1 for using a different cache, but that's being quite unfamiliar with the code. On Tue, Sep 30, 2014 at 1:53 PM, Alan Woodward a...@flax.co.uk wrote: Once all the facets have been gathered, the co-ordinating node then asks the subnodes for an exact count for the final top-N facets, What's the point to refine these counts? I've thought that it make sense only for facet.limit ed requests. Is it correct statement? can those who suffer from the low performance, just unlimit facet.limit to avoid that distributed hop? Presumably yes, but if you've got a sufficiently high cardinality field then any gains made by missing out the hop will probably be offset by having to stream all the return values back again. Alan -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Filter cache pollution during sharded edismax queries
Hi Ken, Have you managed to find out why these entries were stored into filterCache and if they have an impact on the hit ratio ? We noticed the same problem, there are entries of this type : item_+(+(title:western^10.0 | ... in our filterCache. Thanks, Anca On 07/02/2013 09:01 PM, Ken Krugler wrote: Hi all, After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had dropped significantly. Previously it was at 95+%, but now it's 50%. I enabled recording 100 entries for debugging, and in looking at them it seems that edismax (and faceting) is creating entries for me. This is in a sharded setup, so it's a distributed search. If I do a search for the string bogus text using edismax on two fields, I get an entry in each of the shard's filter caches that looks like: item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2): Is this expected? I have a similar situation happening during faceted search, even though my fields are single-value/untokenized strings, and I'm not using the enum facet method. But I'll get many, many entries in the filterCache for facet values, and they all look like item_facet field:facet value: The net result of the above is that even with a very big filterCache size of 2K, the hit ratio is still only 60%. Thanks for any insights, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Filter cache pollution during sharded edismax queries
Ken ... i'm not really sure i'm understanding what you're trying to describe. can you give the full details of a concrete example of what you are seeing? * full requestHandler config * example of query issued by client * every request logged on each shard * contends of filterCache and queryResultCache after client's query finishes -Hoss
Re: Filter cache pollution during sharded edismax queries
Hi Ken, JIRA is kind of stuffed. I'd imagine showing more proof on the ML may be more effective. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Aug 27, 2013 at 4:32 AM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Otis, Sorry I missed your reply, and thanks for trying to find a similar report. Wondering if I should file a Jira issue? That might get more attention :) -- Ken On Jul 5, 2013, at 1:05pm, Otis Gospodnetic wrote: Hi Ken, Uh, I left this email until now hoping I could find you a reference to similar reports, but I can't find them now. I am quite sure I saw somebody with a similar report within the last month. Plus, several people have reported issues with performance dropping when they went from 3.x to 4.x and maybe this is why. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 2, 2013 at 3:01 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi all, After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had dropped significantly. Previously it was at 95+%, but now it's 50%. I enabled recording 100 entries for debugging, and in looking at them it seems that edismax (and faceting) is creating entries for me. This is in a sharded setup, so it's a distributed search. If I do a search for the string bogus text using edismax on two fields, I get an entry in each of the shard's filter caches that looks like: item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2): Is this expected? I have a similar situation happening during faceted search, even though my fields are single-value/untokenized strings, and I'm not using the enum facet method. But I'll get many, many entries in the filterCache for facet values, and they all look like item_facet field:facet value: The net result of the above is that even with a very big filterCache size of 2K, the hit ratio is still only 60%. Thanks for any insights, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
Re: Filter cache pollution during sharded edismax queries
Hi Otis, Sorry I missed your reply, and thanks for trying to find a similar report. Wondering if I should file a Jira issue? That might get more attention :) -- Ken On Jul 5, 2013, at 1:05pm, Otis Gospodnetic wrote: Hi Ken, Uh, I left this email until now hoping I could find you a reference to similar reports, but I can't find them now. I am quite sure I saw somebody with a similar report within the last month. Plus, several people have reported issues with performance dropping when they went from 3.x to 4.x and maybe this is why. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 2, 2013 at 3:01 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi all, After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had dropped significantly. Previously it was at 95+%, but now it's 50%. I enabled recording 100 entries for debugging, and in looking at them it seems that edismax (and faceting) is creating entries for me. This is in a sharded setup, so it's a distributed search. If I do a search for the string bogus text using edismax on two fields, I get an entry in each of the shard's filter caches that looks like: item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2): Is this expected? I have a similar situation happening during faceted search, even though my fields are single-value/untokenized strings, and I'm not using the enum facet method. But I'll get many, many entries in the filterCache for facet values, and they all look like item_facet field:facet value: The net result of the above is that even with a very big filterCache size of 2K, the hit ratio is still only 60%. Thanks for any insights, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
Re: Filter cache pollution during sharded edismax queries
Hi Ken, Uh, I left this email until now hoping I could find you a reference to similar reports, but I can't find them now. I am quite sure I saw somebody with a similar report within the last month. Plus, several people have reported issues with performance dropping when they went from 3.x to 4.x and maybe this is why. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 2, 2013 at 3:01 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi all, After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had dropped significantly. Previously it was at 95+%, but now it's 50%. I enabled recording 100 entries for debugging, and in looking at them it seems that edismax (and faceting) is creating entries for me. This is in a sharded setup, so it's a distributed search. If I do a search for the string bogus text using edismax on two fields, I get an entry in each of the shard's filter caches that looks like: item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2): Is this expected? I have a similar situation happening during faceted search, even though my fields are single-value/untokenized strings, and I'm not using the enum facet method. But I'll get many, many entries in the filterCache for facet values, and they all look like item_facet field:facet value: The net result of the above is that even with a very big filterCache size of 2K, the hit ratio is still only 60%. Thanks for any insights, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
Filter cache pollution during sharded edismax queries
Hi all, After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had dropped significantly. Previously it was at 95+%, but now it's 50%. I enabled recording 100 entries for debugging, and in looking at them it seems that edismax (and faceting) is creating entries for me. This is in a sharded setup, so it's a distributed search. If I do a search for the string bogus text using edismax on two fields, I get an entry in each of the shard's filter caches that looks like: item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2): Is this expected? I have a similar situation happening during faceted search, even though my fields are single-value/untokenized strings, and I'm not using the enum facet method. But I'll get many, many entries in the filterCache for facet values, and they all look like item_facet field:facet value: The net result of the above is that even with a very big filterCache size of 2K, the hit ratio is still only 60%. Thanks for any insights, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr