Re: Processing query clause combinations at indexing time

Michael Froh Tue, 15 Dec 2020 12:19:08 -0800

It's conceptually similar to CommonGrams in the single-field case, though
it doesn't require terms to appear in any particular positions.


It's also able to match across fields, which is where we get a lot of
benefit. We have frequently-occurring filters that get added by various
front-end layers before they hit us (which vary depending on where the
query comes from). In that regard, it's kind of like Solr's filter cache,
except that we identify the filters offline by analyzing query logs, find
common combinations of filters (especially ones where the intersection is
smaller than the smallest term's postings list), and cache the filters in
the index the next time we reindex.

On Tue, Dec 15, 2020 at 9:10 AM Robert Muir <[email protected]> wrote:

> See also commongrams which is a very similar concept:
>
> https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/commongrams
>
> On Tue, Dec 15, 2020 at 12:08 PM Robert Muir <[email protected]> wrote:
> >
> > I wonder if it can be done in a fairly clean way. This sounds similar
> > to using a ShingleFilter to do this optimization, but adding some
> > conditionals so that the index is smaller? Now that we have
> > ConditionalTokenFilter (for branching), can the feature be implemented
> > cleanly?
> >
> > Ideally it wouldn't require a lot of new code, something like checking
> > a "set" + conditionaltokenfilter + shinglefilter?
> >
> > On Mon, Dec 14, 2020 at 2:37 PM Michael Froh <[email protected]> wrote:
> > >
> > > My team at work has a neat feature that we've built on top of Lucene
> that has provided a substantial (20%+) increase in maximum qps and some
> reduction in query latency.
> > >
> > > Basically, we run a training process that looks at historical queries
> to find frequently co-occurring combinations of required clauses, say "+A
> +B +C +D". Then at indexing time, if a document satisfies one of these
> known combinations, we add a new term to the doc, like "opto:ABCD". At
> query time, we can then replace the required clauses with a single
> TermQuery for the "optimized" term.
> > >
> > > It adds a little bit of extra work at indexing time and requires the
> offline training step, but we've found that it yields a significant boost
> at query time.
> > >
> > > We're interested in open-sourcing this feature. Is it something worth
> adding to Lucene? Since it doesn't require any core changes, maybe as a
> module?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Processing query clause combinations at indexing time

Reply via email to