Re: enable disable filter query caching based on statistics

Alessandro Benedetti Fri, 08 Jan 2016 15:39:07 -0800

I read the client was happy, so I am only curious to know more :)
Apart the readibility, shouldn't be more efficient to put the filters
directly in the main query if you don't cache ?
( checking into the code when not caching is adding a lucene boolean query,
with specifically 0 score, maybe this is an indication that at the current
stage this affirmation is not true anymore.
In the past it was a better approach than having them in separate filters.)
How do you specify a filter to be a postFilter and run only over the query
result cache ?
Of course I don't know if you are excluding filters via tags or have some
other requirements.
I saw you specified gain in rpm, and what about the query time ?
Related the rest of the issue is also in the solr comment in the source
code :


org/apache/solr/search/SolrIndexSearcher.java:1597
...

// now actually use the filter cache.
// for large filters that match few documents, this may be
// slower than simply re-executing the query.
if (out.docSet == null) {
out.docSet = getDocSet(cmd.getQuery(),cmd.getFilter());
DocSet bigFilt = getDocSet(cmd.getFilterList());
if (bigFilt != null) out.docSet = out.docSet.intersection(bigFilt);
}

...

Cheers


Binoy:

bq: In such a case won't applying fqs normally be the same as applying
them as post filters

Certainly not, at least AFAIK...

By definition, regular FQs are calculated over the entire corpus
(not, NOT just the docs that satisfy the query). Then that entire
bitset is stored in the filterCache where it can be reused. Which
is why filterCache entries can be used for different queries.

Also by definition, post filters are _not_ calculated over the
entire corpus, they are only calculated for docs that
1> pass the query criteria
and
2> pass all lower-cost filters
so they will not apply at all to the next query, are not stored in
the filterCache etc.

So I think what Matteo is seeing is that with a restrictive FQ clause,
very few docs have to be tested against most of the FQs.

Matteo:

My guess (and I'm not intimately familiar with the code) is that, indeed
the restrictive clause is helping you a lot here. Frankly I doubt if
adding a cost will make a measurable difference if the most restrictive
FQ clause is quite sparse....

I'm still puzzled in your test scenario why there is such a difference when
making all the filer queries cache=false. _Assuming_ that provincia and type
are relatively low-cardinality fields, they should all be in the
filterCache pretty
quickly But perhaps anding the bitset together is more expensive than the
advantage in this case. I'd be curious as to the hit ratio you were seeing.

But as you say, if the client is satisfied I'm not sure it's worth
pursuing...

Best,
Erick

On Tue, Jan 5, 2016 at 11:09 AM, Matteo Grolla <matteo.gro...@gmail.com>
wrote:
> Hi Erik,
>      the test was done on thousands of queries of that kind and milions of
> docs
> I went from <1500 qpm to ~ 6000 qpm on modest virtualized hardware (cpu
> bound and cpu was scarce)
> After that customer happy, time finished and didn't go further but
> definitely cost was something I'd try
> When I saw the presentation of CloudSearch where they explained that they
> were enabling/disabling caching based on fq statistics I thought this kind
> of problem were general enough that I could find a plugin already built
>
> 2016-01-05 19:17 GMT+01:00 Erick Erickson <erickerick...@gmail.com>:
>
>>
>>
&fq={!cache=false}n_rea:xxx&fq={!cache=false}provincia:yyyy,fq={!cache=false}type:zzzz
>>
>> You have a comma in front of the last fq clause, typo?
>>
>> Well, the whole point of caching filter queries is so that the
>> _second_ time you use it,
>> very little work has to be done. That comes at a cost of course for
>> first-time execution.
>> Basically any fq clause that you can guarantee won't be re-used should
>> have cache=false
>> set.
>>
>> I'd be surprised if the second time you use the provincia and type fq
>> clauses not caching
>> would be faster, but I've been surprised before. I guess anding two
>> bitsets together could
>> take more time than, say, testing a small number of individual
>> documents....
>>
>> And I'm assuming that you're testing multiple queries rather than just
>> one-offs.
>>
>> If you _do_ know that some of your clauses are very restrictive, I
>> wonder what happens if
>> you add a cost in. fq's are evaluated in cost order (when
>> cache=false), so what happens
>> in this case?
>> &fq={!cache=false cost=101}n_rea:xxx&fq={!cache=false
>> cost=102}provincia:yyyy&fq={!cache=false cost=103}type:zzzz
>>
>> Best,
>> Erick
>>
>> On Tue, Jan 5, 2016 at 9:41 AM, Matteo Grolla <matteo.gro...@gmail.com>
>> wrote:
>> > Thanks Erik and Binoy,
>> >      This is a case I stumbled upon: with queries like
>> >
>> >
>>
q=*:*&fq={!cache=false}n_rea:xxx&fq={!cache=false}provincia:yyyy,fq={!cache=false}type:zzzz
>> >
>> > where n_rea filter is highly selective
>> > I was able to make > 3x performance improvement disabling cache
>> >
>> > I think it's because the last two filters are not so selective, they
are
>> > resolved to two bitset which are then anded together
>> > and this is less efficient than leapfrogging since the first filter has
>> > just one or two results.
>> > Does it make sense to you?
>> >
>> >
>> >
>> >
>> >
>> > 2016-01-05 16:59 GMT+01:00 Erick Erickson <erickerick...@gmail.com>:
>> >
>> >> Matteo:
>> >>
>> >> Let's see if I understand your problem. Essentially you want
>> >> Solr to analyze the filter queries and decide through some
>> >> algorithm which ones to cache. I have a hard time thinking of
>> >> any general way to do this, certainly there's not hing in Solr
>> >> that does this automatically As Binoy mentions there are some
>> >> ways to influence what goes in the cache, but the algorithm is
>> >> simple....
>> >>
>> >> If you build such a thing, I suspect you'll be implicitly building
>> >> in knowledge of how your particular application uses Solr. For
>> >> sure, the functionality around "no cache filters" is there explicitly
>> >> because some fq clauses (think ACL calculations) can be
>> >> very expensive to calculate for the entire corpus (which is what
>> >> fqs do by default).
>> >>
>> >> But you really haven't given us some examples of what sorts
>> >> of fq clauses you consider "bad". Perhaps there are other ways
>> >> of approaching your problem.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >>
>> >> On Tue, Jan 5, 2016 at 7:50 AM, Binoy Dalal <binoydala...@gmail.com>
>> >> wrote:
>> >> > What is your exact requirement then?
>> >> > I ask, because these settings can solve the problems you've
mentioned
>> >> > without the need to add any additional functionality.
>> >> >
>> >> > On Tue, Jan 5, 2016 at 9:04 PM Matteo Grolla <
matteo.gro...@gmail.com
>> >
>> >> > wrote:
>> >> >
>> >> >> Hi Binoy,
>> >> >>      I know these settings but the problem I'm trying to solve is
>> when
>> >> >> these settings aren't enough.
>> >> >>
>> >> >>
>> >> >> 2016-01-05 16:30 GMT+01:00 Binoy Dalal <binoydala...@gmail.com>:
>> >> >>
>> >> >> > If I understand your problem correctly, then you don't want the
>> most
>> >> >> > frequently used fqs removed and you do not want your filter cache
>> to
>> >> grow
>> >> >> > to very large sizes.
>> >> >> > Well there is already a solution for both of these.
>> >> >> > In the solrconfig.xml file, you can configure the <filterCache>
>> >> parameter
>> >> >> > to suit your needs.
>> >> >> > a) Use the LeastFrequentlyUsed or LFU eviction policy.
>> >> >> > b) Set the size to whatever number of fqs you find suitable.
>> >> >> > You can do this like so:
>> >> >> > <filterCache class="solr.LFUCache" size="100" initialSize="10"
>> >> >> > autoWarmCount="10"/>
>> >> >> > You should play around with these parameters to find the best
>> >> combination
>> >> >> > for your implementation.
>> >> >> > For more details take a look here:
>> >> >> > https://wiki.apache.org/solr/SolrCaching
>> >> >> > http://yonik.com/advanced-filter-caching-in-solr/
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Jan 5, 2016 at 7:28 PM Matteo Grolla <
>> matteo.gro...@gmail.com
>> >> >
>> >> >> > wrote:
>> >> >> >
>> >> >> > > Hi,
>> >> >> > >     after looking at the presentation of cloudsearch from
lucene
>> >> >> > revolution
>> >> >> > > 2014
>> >> >> > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>>
https://www.youtube.com/watch?v=RI1x0d-yO8A&list=PLU6n9Voqu_1FM8nmVwiWWDRtsEjlPqhgP&index=49
>> >> >> > > min 17:08
>> >> >> > >
>> >> >> > > I recognized I'd love to be able to remove the burden of
>> disabling
>> >> >> filter
>> >> >> > > query caching from developers
>> >> >> > >
>> >> >> > > the problem:
>> >> >> > > Solr by default caches filter queries
>> >> >> > > a) When there are filter queries that are not reused and few
that
>> >> are
>> >> >> the
>> >> >> > > good ones get evicted unnecessarily
>> >> >> > > b) if the same query has multiple filter queries that are very
>> >> >> selective
>> >> >> > I
>> >> >> > > noticed a big performance disabling cache
>> >> >> > > c) I'd like to spare developers from deciding what has to be
>> cached
>> >> or
>> >> >> > not
>> >> >> > >
>> >> >> > > the question:
>> >> >> > > -Is there anything already working to solve those problems?
>> >> >> > >
>> >> >> > > what do you think about this?
>> >> >> > > -I was thinking to write a plugin to recognize query types with
>> >> regular
>> >> >> > > exception and let solr admins associate a caching behaviour
with
>> >> each
>> >> >> > query
>> >> >> > > type
>> >> >> > > -another idea was to
>> >> >> > >    -by default set fq caching off
>> >> >> > >    -keep statistics about fq
>> >> >> > >    -enable caching only for the N fq with highest hit ratio
>> >> >> > >
>> >> >> > --
>> >> >> > Regards,
>> >> >> > Binoy Dalal
>> >> >> >
>> >> >>
>> >> > --
>> >> > Regards,
>> >> > Binoy Dalal
>> >>
>>

Re: enable disable filter query caching based on statistics

Reply via email to