Re: Facet Performance

James Bodkin Wed, 17 Jun 2020 07:06:09 -0700

The large majority of the relevant fields have fewer than 20 unique values. We 
have two fields over that with 150 unique values and 5300 unique values 
retrospectively.
At the moment, our filterCache is configured with a maximum size of 8192.


From the DocValues documentation 
(https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
this approach promises to make lookups for faceting, sorting and grouping much 
faster.
Hence I thought that using DocValues would be better than using Indexed and in 
turn improve our response times and possibly lower memory requirements. It 
sounds like this isn't the case if you are able to allocate enough memory to 
the filterCache.

I haven't yet tried changing the uninvertible setting, I was looking at the 
documentation for this field earlier today.
Should we be setting uninvertible="false" if docValues="true" regardless of 
whether indexed is true or false?

Kind Regards,

James Bodkin

On 17/06/2020, 14:02, "Michael Gibney" <mich...@michaelgibney.net> wrote:

    facet.method=enum works by executing a query (against indexed values)
    for each indexed value in a given field (which, for indexed=false, is
    "no values"). So that explains why facet.method=enum no longer works.
    I was going to suggest that you might not want to set indexed=false on
    the docValues facet fields anyway, since the indexed values are still
    used for facet refinement (assuming your index is distributed).

    What's the number of unique values in the relevant fields? If it's low
    enough, setting docValues=false and indexed=true and using
    facet.method=enum (with a sufficiently large filterCache) is
    definitely a viable option, and will almost certainly be faster than
    docValues-based faceting. (As an aside, noting for future reference:
    high-cardinality facets over high-cardinality DocSet domains might be
    able to benefit from a term facet count cache:
    https://issues.apache.org/jira/browse/SOLR-13807)

    I think you didn't specifically mention whether you acted on Erick's
    suggestion of setting "uninvertible=false" (I think Erick accidentally
    said "uninvertible=true") to fail fast. I'd also recommend doing that,
    perhaps even above all else -- it shouldn't actually *do* anything,
    but will help ensure that things are behaving as you expect them to!

    Michael

    On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
    <james.bod...@loveholidays.com> wrote:
    >
    > Thanks, I've implemented some queries that improve the first-hit 
execution for faceting.
    >
    > Since turning off indexed on those fields, we've noticed that 
facet.method=enum no longer returns the facets when used.
    > Using facet.method=fc/fcs is significantly slower compared to 
facet.method=enum for us. Why do these two differences exist?
    >
    > On 16/06/2020, 17:52, "Erick Erickson" <erickerick...@gmail.com> wrote:
    >
    >     Ok, I see the disconnect... Necessary parts if the index are read 
from disk
    >     lazily. So your newSearcher or firstSearcher query needs to do 
whatever
    >     operation causes the relevant parts of the index to be read. In this 
case,
    >     probably just facet on all the fields you care about. I'd add sorting 
too
    >     if you sort on different fields.
    >
    >     The *:* query without facets or sorting does virtually nothing due to 
some
    >     special handling...
    >
    >     On Tue, Jun 16, 2020, 10:48 James Bodkin 
<james.bod...@loveholidays.com>
    >     wrote:
    >
    >     > I've been trying to build a query that I can use in newSearcher 
based off
    >     > the information in your previous e-mail. I thought you meant to 
build a *:*
    >     > query as per Query 1 in my previous e-mail but I'm still seeing the
    >     > first-hit execution.
    >     > Now I'm wondering if you meant to create a *:* query with each of 
the
    >     > fields as part of the fl query parameters or a *:* query with each 
of the
    >     > fields and values as part of the fq query parameters.
    >     >
    >     > At the moment I've been running these manually as I expected that I 
would
    >     > see the first-execution penalty disappear by the time I got to 
query 4, as
    >     > I thought this would replicate the actions of the newSeacher.
    >     > Unfortunately we can't use the autowarm count that is available as 
part of
    >     > the filterCache/filterCache due to the custom deployment mechanism 
we use
    >     > to update our index.
    >     >
    >     > Kind Regards,
    >     >
    >     > James Bodkin
    >     >
    >     > On 16/06/2020, 15:30, "Erick Erickson" <erickerick...@gmail.com> 
wrote:
    >     >
    >     >     Did you try the autowarming like I mentioned in my previous 
e-mail?
    >     >
    >     >     > On Jun 16, 2020, at 10:18 AM, James Bodkin <
    >     > james.bod...@loveholidays.com> wrote:
    >     >     >
    >     >     > We've changed the schema to enable docValues for these fields 
and
    >     > this led to an improvement in the response time. We found a further
    >     > improvement by also switching off indexed as these fields are used 
for
    >     > faceting and filtering only.
    >     >     > Since those changes, we've found that the first-execution for
    >     > queries is really noticeable. I thought this would be the 
filterCache based
    >     > on what I saw in NewRelic however it is probably trying to read the
    >     > docValues from disk. How can we use the autowarming to improve this?
    >     >     >
    >     >     > For example, I've run the following queries in sequence and 
each
    >     > query has a first-execution penalty.
    >     >     >
    >     >     > Query 1:
    >     >     >
    >     >     > q=*:*
    >     >     > facet=true
    >     >     > facet.field=D_DepartureAirport
    >     >     > facet.field=D_Destination
    >     >     > facet.limit=-1
    >     >     > rows=0
    >     >     >
    >     >     > Query 2:
    >     >     >
    >     >     > q=*:*
    >     >     > fq=D_DepartureAirport:(2660)
    >     >     > facet=true
    >     >     > facet.field=D_Destination
    >     >     > facet.limit=-1
    >     >     > rows=0
    >     >     >
    >     >     > Query 3:
    >     >     >
    >     >     > q=*:*
    >     >     > fq=D_DepartureAirport:(2661)
    >     >     > facet=true
    >     >     > facet.field=D_Destination
    >     >     > facet.limit=-1
    >     >     > rows=0
    >     >     >
    >     >     > Query 4:
    >     >     >
    >     >     > q=*:*
    >     >     > fq=D_DepartureAirport:(2660+OR+2661)
    >     >     > facet=true
    >     >     > facet.field=D_Destination
    >     >     > facet.limit=-1
    >     >     > rows=0
    >     >     >
    >     >     > We've kept the field type as a string, as the value is mapped 
by
    >     > application that accesses Solr. In the examples above, the values 
are
    >     > mapped to airports and destinations.
    >     >     > Is it possible to prewarm the above queries without having to 
define
    >     > all the potential filters manually in the auto warming?
    >     >     >
    >     >     > At the moment, we update and optimise our index in a different
    >     > environment and then copy the index to our production instances by 
using a
    >     > rolling deployment in Kubernetes.
    >     >     >
    >     >     > Kind Regards,
    >     >     >
    >     >     > James Bodkin
    >     >     >
    >     >     > On 12/06/2020, 18:58, "Erick Erickson" 
<erickerick...@gmail.com>
    >     > wrote:
    >     >     >
    >     >     >    I question whether fiterCache has anything to do with it, I
    >     > suspect what’s really happening is that first time you’re reading 
the
    >     > relevant bits from disk into memory. And to double check you should 
have
    >     > docVaues enabled for all these fields. The “uninverting” process  
can be
    >     > very expensive, and docValues bypasses that.
    >     >     >
    >     >     >    As of Solr 7.6, you can define “uninvertible=true” to your
    >     > field(Type) to “fail fast” if Solr needs to uninvert the field.
    >     >     >
    >     >     >    But that’s an aside. In either case, my claim is that 
first-time
    >     > execution does “something”, either reads the serialized docValues 
from disk
    >     > or uninverts the file on Solr’s heap.
    >     >     >
    >     >     >    You can have this autowarmed by any combination of
    >     >     >    1> specifying an autowarm count on your queryResultCache. 
That’s
    >     > hit or miss, as it replays the most recent N queries which may or 
may not
    >     > contain the sorts. That said, specifying 10-20 for autowarm count is
    >     > usually a good idea, assuming you’re not committing more than, say, 
every
    >     > 30 seconds. I’d add the same to filterCache too.
    >     >     >
    >     >     >    2> specifying a newSearcher or firstSearcher query in
    >     > solrconfig.xml. The difference is that newSearcher is fired every 
time a
    >     > commit happens, while firstSearcher is only fired when Solr starts, 
the
    >     > theory being that there’s no cache autowarming available when Solr 
fist
    >     > powers up. Usually, people don’t bother with firstSearcher or just 
make it
    >     > the same as newSearcher. Note that a query doesn’t have to be 
“real” at
    >     > all. You can just add all the facet fields to a *:* query in a 
single go.
    >     >     >
    >     >     >    BTW, Trie fields will stay around for a long time even 
though
    >     > deprecated. Or at least until we find something to replace them 
with that
    >     > doesn’t have this penalty, so I’d feel pretty safe using those and 
they’ll
    >     > be more efficient than strings.
    >     >     >
    >     >     >    Best,
    >     >     >    Erick
    >     >     >
    >     >
    >     >

Re: Facet Performance

Reply via email to