Re: Filter cache pollution during sharded edismax queries

2014-10-08 Thread Charlie Hull

On 01/10/2014 09:55, jim ferenczi wrote:

I think you should test with facet.shard.limit=-1 this will disallow the
limit for the facet on the shards and remove the needs for facet
refinements. I bet that returning every facet with a count greater than 0
on internal queries is cheaper than using the filter cache to handle a lot
of refinements.


I'm happy to report that in our case setting facet.limit=-1 has a 
significant impact on performance, cache hit ratios and reduced CPU 
load. Thanks to all who replied!


Cheers

Charlie
Flax


Jim

2014-10-01 10:24 GMT+02:00 Charlie Hull char...@flax.co.uk:


On 30/09/2014 22:25, Erick Erickson wrote:


Just from a 20,000 ft. view, using the filterCache this way seems...odd.

+1 for using a different cache, but that's being quite unfamiliar with the
code.



Here's a quick update:

1. LFUCache performs worse so we returned to LRUCache
2. Making the cache smaller than the default 512 reduced performance.
3. Raising the cache size to 2048 didn't seem to have a significant effect
on performance but did reduce CPU load significantly. This may help our
client as they can reduce their system spec considerably.

We're continuing to test with our client, but the upshot is that even if
you think you don't need the filter cache, if you're doing distributed
faceting you probably do, and you should size it based on experimentation.
In our case there is a single filter but the cache needs to be considerably
larger than that!

Cheers

Charlie




On Tue, Sep 30, 2014 at 1:53 PM, Alan Woodward a...@flax.co.uk wrote:





  Once all the facets have been gathered, the co-ordinating node then

asks
the subnodes for an exact count for the final top-N facets,




What's the point to refine these counts? I've thought that it make sense
only for facet.limit ed requests. Is it correct statement? can those who
suffer from the low performance, just unlimit  facet.limit to avoid that
distributed hop?



Presumably yes, but if you've got a sufficiently high cardinality field
then any gains made by missing out the hop will probably be offset by
having to stream all the return values back again.

Alan


  --

Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com









--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Filter cache pollution during sharded edismax queries

2014-10-01 Thread Charlie Hull

On 30/09/2014 22:25, Erick Erickson wrote:

Just from a 20,000 ft. view, using the filterCache this way seems...odd.

+1 for using a different cache, but that's being quite unfamiliar with the
code.


Here's a quick update:

1. LFUCache performs worse so we returned to LRUCache
2. Making the cache smaller than the default 512 reduced performance.
3. Raising the cache size to 2048 didn't seem to have a significant 
effect on performance but did reduce CPU load significantly. This may 
help our client as they can reduce their system spec considerably.


We're continuing to test with our client, but the upshot is that even if 
you think you don't need the filter cache, if you're doing distributed 
faceting you probably do, and you should size it based on 
experimentation. In our case there is a single filter but the cache 
needs to be considerably larger than that!


Cheers

Charlie



On Tue, Sep 30, 2014 at 1:53 PM, Alan Woodward a...@flax.co.uk wrote:






Once all the facets have been gathered, the co-ordinating node then asks
the subnodes for an exact count for the final top-N facets,



What's the point to refine these counts? I've thought that it make sense
only for facet.limit ed requests. Is it correct statement? can those who
suffer from the low performance, just unlimit  facet.limit to avoid that
distributed hop?


Presumably yes, but if you've got a sufficiently high cardinality field
then any gains made by missing out the hop will probably be offset by
having to stream all the return values back again.

Alan



--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com








--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Filter cache pollution during sharded edismax queries

2014-10-01 Thread jim ferenczi
I think you should test with facet.shard.limit=-1 this will disallow the
limit for the facet on the shards and remove the needs for facet
refinements. I bet that returning every facet with a count greater than 0
on internal queries is cheaper than using the filter cache to handle a lot
of refinements.

Jim

2014-10-01 10:24 GMT+02:00 Charlie Hull char...@flax.co.uk:

 On 30/09/2014 22:25, Erick Erickson wrote:

 Just from a 20,000 ft. view, using the filterCache this way seems...odd.

 +1 for using a different cache, but that's being quite unfamiliar with the
 code.


 Here's a quick update:

 1. LFUCache performs worse so we returned to LRUCache
 2. Making the cache smaller than the default 512 reduced performance.
 3. Raising the cache size to 2048 didn't seem to have a significant effect
 on performance but did reduce CPU load significantly. This may help our
 client as they can reduce their system spec considerably.

 We're continuing to test with our client, but the upshot is that even if
 you think you don't need the filter cache, if you're doing distributed
 faceting you probably do, and you should size it based on experimentation.
 In our case there is a single filter but the cache needs to be considerably
 larger than that!

 Cheers

 Charlie



 On Tue, Sep 30, 2014 at 1:53 PM, Alan Woodward a...@flax.co.uk wrote:



  Once all the facets have been gathered, the co-ordinating node then
 asks
 the subnodes for an exact count for the final top-N facets,



 What's the point to refine these counts? I've thought that it make sense
 only for facet.limit ed requests. Is it correct statement? can those who
 suffer from the low performance, just unlimit  facet.limit to avoid that
 distributed hop?


 Presumably yes, but if you've got a sufficiently high cardinality field
 then any gains made by missing out the hop will probably be offset by
 having to stream all the return values back again.

 Alan


  --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com






 --
 Charlie Hull
 Flax - Open Source Enterprise Search

 tel/fax: +44 (0)8700 118334
 mobile:  +44 (0)7767 825828
 web: www.flax.co.uk



Re: Filter cache pollution during sharded edismax queries

2014-10-01 Thread Chris Hostetter

: +1 for using a different cache, but that's being quite unfamiliar with the
: code.

in (a) common case, people tend to drill down and filter on facet 
constraints -- so using a special purpose cache for the refinements would 
result in redundent caching of the same info in multiple places.

:   What's the point to refine these counts? I've thought that it make sense
:   only for facet.limit ed requests. Is it correct statement? can those who

refinement only happens if facet.limit is used and there are eligable 
top constraints that were not returned by some shards.  

:   suffer from the low performance, just unlimit  facet.limit to avoid that
:   distributed hop?

As noted, setting facet.limit=-1 might help for low cardinality fields to 
ensure that every shard returns a count for every value and no-refinement 
is needed, but that doesn't really help you for fields with 
unknown/unbounded cardinality.

As part of the distributed pivot faceting work, the amount of 
overrequest done in phase 1 (for both facet.pivot  facet.field) was 
made configurable via 2 new parameters...

https://lucene.apache.org/solr/4_10_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_RATIO
https://lucene.apache.org/solr/4_10_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_COUNT

...so depending on the distribution of your data, you might find that by 
adjusting those values to increase the amount of overrequesting done, you 
can decrease the amount of refinement needed -- but there are obviously 
tradeoffs.



-Hoss
http://www.lucidworks.com/


RE: Filter cache pollution during sharded edismax queries

2014-10-01 Thread Toke Eskildsen
From: Charlie Hull [char...@flax.co.uk]:
 We've just found a very similar issue at a client installation. They have
 around 27 million documents and are faceting on fields with high
 cardinality, and are unhappy with query performance and the server hardware
 necessary to make this performance acceptable.

I have done some testing on distributed non-pivot faceting and found that the 
fine-counting of the top-X terms can be very expensive for some queries. It 
seems that for fc-faceting with Strings it is markedly faster (and 
non-filter-cache-blowing) to do a standard faceting call and extract the 
relevant term counts for fine-counting instead of processing the requested 
terms one at a time. It seems that the same principle might apply to pivot 
faceting.

There's a write-up with graphs at
http://sbdevel.wordpress.com/2014/08/26/ten-times-faster/

- Toke Eskildsen


Re: Filter cache pollution during sharded edismax queries

2014-10-01 Thread Mikhail Khludnev
Hoss,

Nice to hear you! I wonder if there is a sequence chart, or maybe a deck,
which explains the whole picture of distributed search, especially these
ones?
If it hasn't been presented to community so far, I'm aware of one
conference which can accept such talk. WDYT?

On Wed, Oct 1, 2014 at 9:17 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : +1 for using a different cache, but that's being quite unfamiliar with
 the
 : code.

 in (a) common case, people tend to drill down and filter on facet
 constraints -- so using a special purpose cache for the refinements would
 result in redundent caching of the same info in multiple places.

 :   What's the point to refine these counts? I've thought that it make
 sense
 :   only for facet.limit ed requests. Is it correct statement? can those
 who

 refinement only happens if facet.limit is used and there are eligable
 top constraints that were not returned by some shards.

 :   suffer from the low performance, just unlimit  facet.limit to avoid
 that
 :   distributed hop?

 As noted, setting facet.limit=-1 might help for low cardinality fields to
 ensure that every shard returns a count for every value and no-refinement
 is needed, but that doesn't really help you for fields with
 unknown/unbounded cardinality.

 As part of the distributed pivot faceting work, the amount of
 overrequest done in phase 1 (for both facet.pivot  facet.field) was
 made configurable via 2 new parameters...


 https://lucene.apache.org/solr/4_10_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_RATIO

 https://lucene.apache.org/solr/4_10_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_COUNT

 ...so depending on the distribution of your data, you might find that by
 adjusting those values to increase the amount of overrequesting done, you
 can decrease the amount of refinement needed -- but there are obviously
 tradeoffs.



 -Hoss
 http://www.lucidworks.com/




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: Filter cache pollution during sharded edismax queries

2014-09-30 Thread Charlie Hull
Hi,

We've just found a very similar issue at a client installation. They have
around 27 million documents and are faceting on fields with high
cardinality, and are unhappy with query performance and the server hardware
necessary to make this performance acceptable. Last night we noticed the
filter cache had a pretty low hit rate and seemed to be filling up with
many unexpected items (we were testing with only a *single* actual filter
query). Diagnosing this with the showItems flag set on the Solr admin
statistics we could see entries relating to facets, even though we were
sure we were using the default facet.method=fc setting that should prevent
filters being constructed. We're thus seeing similar cache pollution to Ken
and Anca.

We're trying a different type of cache (LFUCache) now and also may try
tweaking cache sizes to try and help, as the filter creation seems to be
something we can't easily get round.

cheers

Charlie
Flax
www.flax.co.uk

On 18 October 2013 14:32, Anca Kopetz anca.kop...@kelkoo.com wrote:

 Hi Ken,

 Have you managed to find out why these entries were stored into
 filterCache and if they have an impact on the hit ratio ?
 We noticed the same problem, there are entries of this type :
 item_+(+(title:western^10.0 | ... in our filterCache.

 Thanks,
 Anca


 On 07/02/2013 09:01 PM, Ken Krugler wrote:

 Hi all,

 After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit
 ratio had dropped significantly.

 Previously it was at 95+%, but now it's  50%.

 I enabled recording 100 entries for debugging, and in looking at them it
 seems that edismax (and faceting) is creating entries for me.

 This is in a sharded setup, so it's a distributed search.

 If I do a search for the string bogus text using edismax on two fields,
 I get an entry in each of the shard's filter caches that looks like:

 item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2):

 Is this expected?

 I have a similar situation happening during faceted search, even though my
 fields are single-value/untokenized strings, and I'm not using the enum
 facet method.

 But I'll get many, many entries in the filterCache for facet values, and
 they all look like item_facet field:facet value:

 The net result of the above is that even with a very big filterCache size
 of 2K, the hit ratio is still only 60%.

 Thanks for any insights,

 -- Ken

 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr








 
 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.



Re: Filter cache pollution during sharded edismax queries

2014-09-30 Thread Alan Woodward
A bit of digging show that the extra entries in the filter cache are added when 
getting facets from a distributed search.  Once all the facets have been 
gathered, the co-ordinating node then asks the subnodes for an exact count for 
the final top-N facets, and the path for executing this goes though:
SimpleFacets.getListedTermCounts()
-- SolrIndexSearcher.numDocs()
-- SolrIndexSearcher.getPositiveDocSet()
and this last method caches results in the filter cache.

Maybe these should be using a separate cache?

Alan Woodward
www.flax.co.uk


On 30 Sep 2014, at 11:38, Charlie Hull wrote:

 Hi,
 
 We've just found a very similar issue at a client installation. They have
 around 27 million documents and are faceting on fields with high
 cardinality, and are unhappy with query performance and the server hardware
 necessary to make this performance acceptable. Last night we noticed the
 filter cache had a pretty low hit rate and seemed to be filling up with
 many unexpected items (we were testing with only a *single* actual filter
 query). Diagnosing this with the showItems flag set on the Solr admin
 statistics we could see entries relating to facets, even though we were
 sure we were using the default facet.method=fc setting that should prevent
 filters being constructed. We're thus seeing similar cache pollution to Ken
 and Anca.
 
 We're trying a different type of cache (LFUCache) now and also may try
 tweaking cache sizes to try and help, as the filter creation seems to be
 something we can't easily get round.
 
 cheers
 
 Charlie
 Flax
 www.flax.co.uk
 
 On 18 October 2013 14:32, Anca Kopetz anca.kop...@kelkoo.com wrote:
 
 Hi Ken,
 
 Have you managed to find out why these entries were stored into
 filterCache and if they have an impact on the hit ratio ?
 We noticed the same problem, there are entries of this type :
 item_+(+(title:western^10.0 | ... in our filterCache.
 
 Thanks,
 Anca
 
 
 On 07/02/2013 09:01 PM, Ken Krugler wrote:
 
 Hi all,
 
 After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit
 ratio had dropped significantly.
 
 Previously it was at 95+%, but now it's  50%.
 
 I enabled recording 100 entries for debugging, and in looking at them it
 seems that edismax (and faceting) is creating entries for me.
 
 This is in a sharded setup, so it's a distributed search.
 
 If I do a search for the string bogus text using edismax on two fields,
 I get an entry in each of the shard's filter caches that looks like:
 
 item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2):
 
 Is this expected?
 
 I have a similar situation happening during faceted search, even though my
 fields are single-value/untokenized strings, and I'm not using the enum
 facet method.
 
 But I'll get many, many entries in the filterCache for facet values, and
 they all look like item_facet field:facet value:
 
 The net result of the above is that even with a very big filterCache size
 of 2K, the hit ratio is still only 60%.
 
 Thanks for any insights,
 
 -- Ken
 
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr
 
 
 
 
 
 
 
 
 
 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris
 
 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.
 



Re: Filter cache pollution during sharded edismax queries

2014-09-30 Thread Shawn Heisey
On 9/30/2014 4:38 AM, Charlie Hull wrote:
 We've just found a very similar issue at a client installation. They have
 around 27 million documents and are faceting on fields with high
 cardinality, and are unhappy with query performance and the server hardware
 necessary to make this performance acceptable. Last night we noticed the
 filter cache had a pretty low hit rate and seemed to be filling up with
 many unexpected items (we were testing with only a *single* actual filter
 query). Diagnosing this with the showItems flag set on the Solr admin
 statistics we could see entries relating to facets, even though we were
 sure we were using the default facet.method=fc setting that should prevent
 filters being constructed. We're thus seeing similar cache pollution to Ken
 and Anca.
 
 We're trying a different type of cache (LFUCache) now and also may try
 tweaking cache sizes to try and help, as the filter creation seems to be
 something we can't easily get round.

Since I was the one who wrote the current LFUCache implementation you'll
find in Solr, I can tell you the implementation is very naive.  It
correctly implements LFU, but it does so in a beginning programming
student way.  To decide which entry to evict, it must basically sort
the list by the number of times each entry has been used.  Because that
number can continually change on each entry, that sort must be done
every time an eviction must happen.

Unless the cache size is very small, I would not expect the performance
to be very good when the cache gets full and it must decide which
entries to evict.  I don't know what number qualifies as very small
... I'm not sure I'd go above 32 or 64.  As the size goes up, the
performance of adding a new entry to a full cache will go down.

I've got a very efficient new cache implementation in Jira, but haven't
had the time to devote to getting it polished and committed.

Thanks,
Shawn



Re: Filter cache pollution during sharded edismax queries

2014-09-30 Thread Mikhail Khludnev
Hello,

I already saw such discussion, but want to confirm.

On Tue, Sep 30, 2014 at 2:59 PM, Alan Woodward a...@flax.co.uk wrote:

 Once all the facets have been gathered, the co-ordinating node then asks
 the subnodes for an exact count for the final top-N facets,


What's the point to refine these counts? I've thought that it make sense
only for facet.limit ed requests. Is it correct statement? can those who
suffer from the low performance, just unlimit  facet.limit to avoid that
distributed hop?
Thanks


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: Filter cache pollution during sharded edismax queries

2014-09-30 Thread Alan Woodward

 
 Once all the facets have been gathered, the co-ordinating node then asks
 the subnodes for an exact count for the final top-N facets,
 
 
 What's the point to refine these counts? I've thought that it make sense
 only for facet.limit ed requests. Is it correct statement? can those who
 suffer from the low performance, just unlimit  facet.limit to avoid that
 distributed hop?

Presumably yes, but if you've got a sufficiently high cardinality field then 
any gains made by missing out the hop will probably be offset by having to 
stream all the return values back again.

Alan


 -- 
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics
 
 http://www.griddynamics.com
 mkhlud...@griddynamics.com



Re: Filter cache pollution during sharded edismax queries

2014-09-30 Thread Erick Erickson
Just from a 20,000 ft. view, using the filterCache this way seems...odd.

+1 for using a different cache, but that's being quite unfamiliar with the
code.

On Tue, Sep 30, 2014 at 1:53 PM, Alan Woodward a...@flax.co.uk wrote:


 
  Once all the facets have been gathered, the co-ordinating node then asks
  the subnodes for an exact count for the final top-N facets,
 
 
  What's the point to refine these counts? I've thought that it make sense
  only for facet.limit ed requests. Is it correct statement? can those who
  suffer from the low performance, just unlimit  facet.limit to avoid that
  distributed hop?

 Presumably yes, but if you've got a sufficiently high cardinality field
 then any gains made by missing out the hop will probably be offset by
 having to stream all the return values back again.

 Alan


  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com




Re: Filter cache pollution during sharded edismax queries

2013-10-18 Thread Anca Kopetz

Hi Ken,

Have you managed to find out why these entries were stored into filterCache and 
if they have an impact on the hit ratio ?
We noticed the same problem, there are entries of this type : 
item_+(+(title:western^10.0 | ... in our filterCache.

Thanks,
Anca

On 07/02/2013 09:01 PM, Ken Krugler wrote:

Hi all,

After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had 
dropped significantly.

Previously it was at 95+%, but now it's  50%.

I enabled recording 100 entries for debugging, and in looking at them it seems 
that edismax (and faceting) is creating entries for me.

This is in a sharded setup, so it's a distributed search.

If I do a search for the string bogus text using edismax on two fields, I get 
an entry in each of the shard's filter caches that looks like:

item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2):

Is this expected?

I have a similar situation happening during faceted search, even though my 
fields are single-value/untokenized strings, and I'm not using the enum facet 
method.

But I'll get many, many entries in the filterCache for facet values, and they all look like 
item_facet field:facet value:

The net result of the above is that even with a very big filterCache size of 
2K, the hit ratio is still only 60%.

Thanks for any insights,

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr









Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Filter cache pollution during sharded edismax queries

2013-08-28 Thread Chris Hostetter

Ken ... i'm not really sure i'm understanding what you're trying to 
describe.  can you give the full details of a concrete example of what you 
are seeing?

* full requestHandler config
* example of query issued by client
* every request logged on each shard
* contends of filterCache and queryResultCache after client's query finishes


-Hoss


Re: Filter cache pollution during sharded edismax queries

2013-08-27 Thread Otis Gospodnetic
Hi Ken,

JIRA is kind of stuffed.  I'd imagine showing more proof on the ML may
be more effective.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Tue, Aug 27, 2013 at 4:32 AM, Ken Krugler
kkrugler_li...@transpac.com wrote:
 Hi Otis,

 Sorry I missed your reply, and thanks for trying to find a similar report.

 Wondering if I should file a Jira issue? That might get more attention :)

 -- Ken

 On Jul 5, 2013, at 1:05pm, Otis Gospodnetic wrote:

 Hi Ken,

 Uh, I left this email until now hoping I could find you a reference to
 similar reports, but I can't find them now.  I am quite sure I saw
 somebody with a similar report within the last month.  Plus, several
 people have reported issues with performance dropping when they went
 from 3.x to 4.x and maybe this is why.

 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm



 On Tue, Jul 2, 2013 at 3:01 PM, Ken Krugler kkrugler_li...@transpac.com 
 wrote:
 Hi all,

 After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio 
 had dropped significantly.

 Previously it was at 95+%, but now it's  50%.

 I enabled recording 100 entries for debugging, and in looking at them it 
 seems that edismax (and faceting) is creating entries for me.

 This is in a sharded setup, so it's a distributed search.

 If I do a search for the string bogus text using edismax on two fields, I 
 get an entry in each of the shard's filter caches that looks like:

 item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2):

 Is this expected?

 I have a similar situation happening during faceted search, even though my 
 fields are single-value/untokenized strings, and I'm not using the enum 
 facet method.

 But I'll get many, many entries in the filterCache for facet values, and 
 they all look like item_facet field:facet value:

 The net result of the above is that even with a very big filterCache size 
 of 2K, the hit ratio is still only 60%.

 Thanks for any insights,

 -- Ken

 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr







Re: Filter cache pollution during sharded edismax queries

2013-08-26 Thread Ken Krugler
Hi Otis,

Sorry I missed your reply, and thanks for trying to find a similar report.

Wondering if I should file a Jira issue? That might get more attention :)

-- Ken

On Jul 5, 2013, at 1:05pm, Otis Gospodnetic wrote:

 Hi Ken,
 
 Uh, I left this email until now hoping I could find you a reference to
 similar reports, but I can't find them now.  I am quite sure I saw
 somebody with a similar report within the last month.  Plus, several
 people have reported issues with performance dropping when they went
 from 3.x to 4.x and maybe this is why.
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm
 
 
 
 On Tue, Jul 2, 2013 at 3:01 PM, Ken Krugler kkrugler_li...@transpac.com 
 wrote:
 Hi all,
 
 After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio 
 had dropped significantly.
 
 Previously it was at 95+%, but now it's  50%.
 
 I enabled recording 100 entries for debugging, and in looking at them it 
 seems that edismax (and faceting) is creating entries for me.
 
 This is in a sharded setup, so it's a distributed search.
 
 If I do a search for the string bogus text using edismax on two fields, I 
 get an entry in each of the shard's filter caches that looks like:
 
 item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2):
 
 Is this expected?
 
 I have a similar situation happening during faceted search, even though my 
 fields are single-value/untokenized strings, and I'm not using the enum 
 facet method.
 
 But I'll get many, many entries in the filterCache for facet values, and 
 they all look like item_facet field:facet value:
 
 The net result of the above is that even with a very big filterCache size of 
 2K, the hit ratio is still only 60%.
 
 Thanks for any insights,
 
 -- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







Re: Filter cache pollution during sharded edismax queries

2013-07-05 Thread Otis Gospodnetic
Hi Ken,

Uh, I left this email until now hoping I could find you a reference to
similar reports, but I can't find them now.  I am quite sure I saw
somebody with a similar report within the last month.  Plus, several
people have reported issues with performance dropping when they went
from 3.x to 4.x and maybe this is why.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Tue, Jul 2, 2013 at 3:01 PM, Ken Krugler kkrugler_li...@transpac.com wrote:
 Hi all,

 After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio 
 had dropped significantly.

 Previously it was at 95+%, but now it's  50%.

 I enabled recording 100 entries for debugging, and in looking at them it 
 seems that edismax (and faceting) is creating entries for me.

 This is in a sharded setup, so it's a distributed search.

 If I do a search for the string bogus text using edismax on two fields, I 
 get an entry in each of the shard's filter caches that looks like:

 item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2):

 Is this expected?

 I have a similar situation happening during faceted search, even though my 
 fields are single-value/untokenized strings, and I'm not using the enum facet 
 method.

 But I'll get many, many entries in the filterCache for facet values, and they 
 all look like item_facet field:facet value:

 The net result of the above is that even with a very big filterCache size of 
 2K, the hit ratio is still only 60%.

 Thanks for any insights,

 -- Ken

 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr







Filter cache pollution during sharded edismax queries

2013-07-02 Thread Ken Krugler
Hi all,

After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had 
dropped significantly.

Previously it was at 95+%, but now it's  50%.

I enabled recording 100 entries for debugging, and in looking at them it seems 
that edismax (and faceting) is creating entries for me.

This is in a sharded setup, so it's a distributed search.

If I do a search for the string bogus text using edismax on two fields, I get 
an entry in each of the shard's filter caches that looks like:

item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2):

Is this expected?

I have a similar situation happening during faceted search, even though my 
fields are single-value/untokenized strings, and I'm not using the enum facet 
method.

But I'll get many, many entries in the filterCache for facet values, and they 
all look like item_facet field:facet value:

The net result of the above is that even with a very big filterCache size of 
2K, the hit ratio is still only 60%.

Thanks for any insights,

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr