Re: Facet Performance

James Bodkin Wed, 17 Jun 2020 09:47:50 -0700

We've noticed that the filterCache uses a significant amount of memory, as 
we've assigned 8GB Heap per instance.
In total, we have 32 shards with 2 replicas, hence (8*32*2) 512G Heap space 
alone, further memory is required to ensure the index is always memory mapped 
for performance reasons.


Ideally I would like to be able to reduce the amount of memory assigned to the 
heap by using docValues instead of indexed but it doesn't seem possible.
The QTime (after warming) for facet.method=enum is around 150-250ms whereas the 
QTime for facet.method=fc is around 1000-1200ms.
As we require the results in real-time for customers searching on our website, 
the later QTime of 1000-1200ms is too slow for us to be able to use.

Our facet queries change as the customer selects different search criteria, and 
hence the possible number of potential queries makes it very difficult for the 
query result cache.
We already have a custom implementation in which we check our redis cache for 
queries before they are sent to our aggregators which runs at 30% hit rate.

Kind Regards,

James Bodkin

On 17/06/2020, 16:21, "Michael Gibney" <mich...@michaelgibney.net> wrote:

    To expand a bit on what Erick said regarding performance: my sense is
    that the RefGuide assertion that "docValues=true" makes faceting
    "faster" could use some qualification/clarification. My take, fwiw:

    First, to reiterate/paraphrase what Erick said: the "faster" assertion
    is not comparing to "facet.method=enum". For low-cardinality fields,
    if you have the heap space, and are very intentional about configuring
    your filterCache (and monitoring it as access patterns might change),
    "facet.method=enum" will likely be as fast as you can get (at least
    for "legacy" facets or whatever -- not sure about "enum" method in
    JSON facets).

    Even where "docValues=true" arguably does make faceting "faster", the
    main benefit is that the "uninverted" data structures are serialized
    on disk, so you're avoiding the need to uninvert each facet field
    on-heap for every new indexSearcher, which is generally high-latency
    -- user perception of this latency can be mitigated using warming
    queries, but it can still be problematic, esp. for frequent index
    updates. On-heap uninversion also inherently consumes a lot of heap
    space, which has general implications wrt GC, etc ... so in that
    respect even if faceting per se might not be "faster" with
    "docValues=true", your overall system may in many cases perform
    better.

    (and Anthony, I'm pretty sure that tag/ex on facets should be
    orthogonal to the "facet.method=enum"/filterCache discussion, as
    tag/ex only affects the DocSet domain over which facets are calculated
    ... I think that step is pretty cleanly separated from the actual
    calculation of the facets. I'm not 100% sure on that, so proceed with
    caution, but it could definitely be worth evaluating for your use
    case!)

    Michael

    On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson <erickerick...@gmail.com> 
wrote:
    >
    > Uninvertible is a safety mechanism to make sure that you don’t 
_unknowingly_ use a docValues=false
    > field for faceting/grouping/sorting/function queries. The primary point 
of docValues=true is twofold:
    >
    > 1> reduce Java heap requirements by using the OS memory to hold it
    >
    > 2> uninverting can be expensive CPU wise too, although not with just a few
    >     unique values (for each term, read the list of docs that have it and 
flip a bit).
    >
    > It doesn’t really make sense to set it on an index=false field, since 
uninverting only happens on
    > index=true docValues=false. OTOH, I don’t think it would do any harm 
either. That said, I frankly
    > don’t know how that interacts with facet.method=enum.
    >
    > As far as speed… yeah, you’re in the edge cases. All things being equal, 
stuffing these into the
    > filterCache is the fastest way to facet if you have the memory. I’ve seen 
very few installations
    > where people have that luxury though. Each entry in the filterCache can 
occupy maxDoc/8 + some overhead
    > bytes. If maxDoc is very large, this’ll chew up an enormous amount of 
memory. I’m cheating
    > a bit here since the size might be smaller if only a few docs have any 
particular entry then the
    > size is smaller. But that’s the worst-case you have to allow for ‘cause 
you could theoretically hit
    > the perfect storm where, due to some particular sequence of queries, your 
entire filter
    > cache fills up with entries that size.
    >
    > You’ll have some overhead to keep the cache at that size, but it sounds 
like it’s worth it.
    >
    > Best,
    > Erick
    >
    >
    >
    > > On Jun 17, 2020, at 10:05 AM, James Bodkin 
<james.bod...@loveholidays.com> wrote:
    > >
    > > The large majority of the relevant fields have fewer than 20 unique 
values. We have two fields over that with 150 unique values and 5300 unique 
values retrospectively.
    > > At the moment, our filterCache is configured with a maximum size of 
8192.
    > >
    > > From the DocValues documentation 
(https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
this approach promises to make lookups for faceting, sorting and grouping much 
faster.
    > > Hence I thought that using DocValues would be better than using Indexed 
and in turn improve our response times and possibly lower memory requirements. 
It sounds like this isn't the case if you are able to allocate enough memory to 
the filterCache.
    > >
    > > I haven't yet tried changing the uninvertible setting, I was looking at 
the documentation for this field earlier today.
    > > Should we be setting uninvertible="false" if docValues="true" 
regardless of whether indexed is true or false?
    > >
    > > Kind Regards,
    > >
    > > James Bodkin
    > >
    > > On 17/06/2020, 14:02, "Michael Gibney" <mich...@michaelgibney.net> 
wrote:
    > >
    > >    facet.method=enum works by executing a query (against indexed values)
    > >    for each indexed value in a given field (which, for indexed=false, is
    > >    "no values"). So that explains why facet.method=enum no longer works.
    > >    I was going to suggest that you might not want to set indexed=false 
on
    > >    the docValues facet fields anyway, since the indexed values are still
    > >    used for facet refinement (assuming your index is distributed).
    > >
    > >    What's the number of unique values in the relevant fields? If it's 
low
    > >    enough, setting docValues=false and indexed=true and using
    > >    facet.method=enum (with a sufficiently large filterCache) is
    > >    definitely a viable option, and will almost certainly be faster than
    > >    docValues-based faceting. (As an aside, noting for future reference:
    > >    high-cardinality facets over high-cardinality DocSet domains might be
    > >    able to benefit from a term facet count cache:
    > >    https://issues.apache.org/jira/browse/SOLR-13807)
    > >
    > >    I think you didn't specifically mention whether you acted on Erick's
    > >    suggestion of setting "uninvertible=false" (I think Erick 
accidentally
    > >    said "uninvertible=true") to fail fast. I'd also recommend doing 
that,
    > >    perhaps even above all else -- it shouldn't actually *do* anything,
    > >    but will help ensure that things are behaving as you expect them to!
    > >
    > >    Michael
    > >
    > >    On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
    > >    <james.bod...@loveholidays.com> wrote:
    > >>
    > >> Thanks, I've implemented some queries that improve the first-hit 
execution for faceting.
    > >>
    > >> Since turning off indexed on those fields, we've noticed that 
facet.method=enum no longer returns the facets when used.
    > >> Using facet.method=fc/fcs is significantly slower compared to 
facet.method=enum for us. Why do these two differences exist?
    > >>
    > >> On 16/06/2020, 17:52, "Erick Erickson" <erickerick...@gmail.com> 
wrote:
    > >>
    > >>    Ok, I see the disconnect... Necessary parts if the index are read 
from disk
    > >>    lazily. So your newSearcher or firstSearcher query needs to do 
whatever
    > >>    operation causes the relevant parts of the index to be read. In 
this case,
    > >>    probably just facet on all the fields you care about. I'd add 
sorting too
    > >>    if you sort on different fields.
    > >>
    > >>    The *:* query without facets or sorting does virtually nothing due 
to some
    > >>    special handling...
    > >>
    > >>    On Tue, Jun 16, 2020, 10:48 James Bodkin 
<james.bod...@loveholidays.com>
    > >>    wrote:
    > >>
    > >>> I've been trying to build a query that I can use in newSearcher based 
off
    > >>> the information in your previous e-mail. I thought you meant to build 
a *:*
    > >>> query as per Query 1 in my previous e-mail but I'm still seeing the
    > >>> first-hit execution.
    > >>> Now I'm wondering if you meant to create a *:* query with each of the
    > >>> fields as part of the fl query parameters or a *:* query with each of 
the
    > >>> fields and values as part of the fq query parameters.
    > >>>
    > >>> At the moment I've been running these manually as I expected that I 
would
    > >>> see the first-execution penalty disappear by the time I got to query 
4, as
    > >>> I thought this would replicate the actions of the newSeacher.
    > >>> Unfortunately we can't use the autowarm count that is available as 
part of
    > >>> the filterCache/filterCache due to the custom deployment mechanism we 
use
    > >>> to update our index.
    > >>>
    > >>> Kind Regards,
    > >>>
    > >>> James Bodkin
    > >>>
    > >>> On 16/06/2020, 15:30, "Erick Erickson" <erickerick...@gmail.com> 
wrote:
    > >>>
    > >>>    Did you try the autowarming like I mentioned in my previous e-mail?
    > >>>
    > >>>> On Jun 16, 2020, at 10:18 AM, James Bodkin <
    > >>> james.bod...@loveholidays.com> wrote:
    > >>>>
    > >>>> We've changed the schema to enable docValues for these fields and
    > >>> this led to an improvement in the response time. We found a further
    > >>> improvement by also switching off indexed as these fields are used for
    > >>> faceting and filtering only.
    > >>>> Since those changes, we've found that the first-execution for
    > >>> queries is really noticeable. I thought this would be the filterCache 
based
    > >>> on what I saw in NewRelic however it is probably trying to read the
    > >>> docValues from disk. How can we use the autowarming to improve this?
    > >>>>
    > >>>> For example, I've run the following queries in sequence and each
    > >>> query has a first-execution penalty.
    > >>>>
    > >>>> Query 1:
    > >>>>
    > >>>> q=*:*
    > >>>> facet=true
    > >>>> facet.field=D_DepartureAirport
    > >>>> facet.field=D_Destination
    > >>>> facet.limit=-1
    > >>>> rows=0
    > >>>>
    > >>>> Query 2:
    > >>>>
    > >>>> q=*:*
    > >>>> fq=D_DepartureAirport:(2660)
    > >>>> facet=true
    > >>>> facet.field=D_Destination
    > >>>> facet.limit=-1
    > >>>> rows=0
    > >>>>
    > >>>> Query 3:
    > >>>>
    > >>>> q=*:*
    > >>>> fq=D_DepartureAirport:(2661)
    > >>>> facet=true
    > >>>> facet.field=D_Destination
    > >>>> facet.limit=-1
    > >>>> rows=0
    > >>>>
    > >>>> Query 4:
    > >>>>
    > >>>> q=*:*
    > >>>> fq=D_DepartureAirport:(2660+OR+2661)
    > >>>> facet=true
    > >>>> facet.field=D_Destination
    > >>>> facet.limit=-1
    > >>>> rows=0
    > >>>>
    > >>>> We've kept the field type as a string, as the value is mapped by
    > >>> application that accesses Solr. In the examples above, the values are
    > >>> mapped to airports and destinations.
    > >>>> Is it possible to prewarm the above queries without having to define
    > >>> all the potential filters manually in the auto warming?
    > >>>>
    > >>>> At the moment, we update and optimise our index in a different
    > >>> environment and then copy the index to our production instances by 
using a
    > >>> rolling deployment in Kubernetes.
    > >>>>
    > >>>> Kind Regards,
    > >>>>
    > >>>> James Bodkin
    > >>>>
    > >>>> On 12/06/2020, 18:58, "Erick Erickson" <erickerick...@gmail.com>
    > >>> wrote:
    > >>>>
    > >>>>   I question whether fiterCache has anything to do with it, I
    > >>> suspect what’s really happening is that first time you’re reading the
    > >>> relevant bits from disk into memory. And to double check you should 
have
    > >>> docVaues enabled for all these fields. The “uninverting” process  can 
be
    > >>> very expensive, and docValues bypasses that.
    > >>>>
    > >>>>   As of Solr 7.6, you can define “uninvertible=true” to your
    > >>> field(Type) to “fail fast” if Solr needs to uninvert the field.
    > >>>>
    > >>>>   But that’s an aside. In either case, my claim is that first-time
    > >>> execution does “something”, either reads the serialized docValues 
from disk
    > >>> or uninverts the file on Solr’s heap.
    > >>>>
    > >>>>   You can have this autowarmed by any combination of
    > >>>>   1> specifying an autowarm count on your queryResultCache. That’s
    > >>> hit or miss, as it replays the most recent N queries which may or may 
not
    > >>> contain the sorts. That said, specifying 10-20 for autowarm count is
    > >>> usually a good idea, assuming you’re not committing more than, say, 
every
    > >>> 30 seconds. I’d add the same to filterCache too.
    > >>>>
    > >>>>   2> specifying a newSearcher or firstSearcher query in
    > >>> solrconfig.xml. The difference is that newSearcher is fired every 
time a
    > >>> commit happens, while firstSearcher is only fired when Solr starts, 
the
    > >>> theory being that there’s no cache autowarming available when Solr 
fist
    > >>> powers up. Usually, people don’t bother with firstSearcher or just 
make it
    > >>> the same as newSearcher. Note that a query doesn’t have to be “real” 
at
    > >>> all. You can just add all the facet fields to a *:* query in a single 
go.
    > >>>>
    > >>>>   BTW, Trie fields will stay around for a long time even though
    > >>> deprecated. Or at least until we find something to replace them with 
that
    > >>> doesn’t have this penalty, so I’d feel pretty safe using those and 
they’ll
    > >>> be more efficient than strings.
    > >>>>
    > >>>>   Best,
    > >>>>   Erick
    > >>>>
    > >>>
    > >>>
    >

Re: Facet Performance

Reply via email to