Re: Processing query clause combinations at indexing time

Robert Muir Tue, 15 Dec 2020 19:24:46 -0800

You can look at IndexSearcher.setQueryCache etc for more details.
Especially LRUQueryCache.


Maybe we should celebrate a little bit if its already 80% of the way
there for your use-case, but at the same time, perhaps defaults could
be better. There is a lot going on here, for example decisions about
which datastructures to use in the cache (sparse bitsets and so on)
that all have tradeoffs.

But IMO the out-of-box defaults should be as good as possible since it
has the huge benefit of requiring zero effort from the user.

On Tue, Dec 15, 2020 at 8:05 PM Michael Froh <[email protected]> wrote:
>
> We don't handle positional queries in our use-case, but that's just because 
> we don't happen to have many positional queries. But if we identify documents 
> at indexing time that contain a given phrase/slop/etc. query, then we can tag 
> the documents with a term that indicates that (or, more likely, tag documents 
> that contain that positional query AND some other queries). We can identify 
> documents that match a PhraseQuery, for example, by adding appending a 
> TokenFilter for the relevant field that "listens" for the given phrase.
>
> Our use-case has only needed TermQuery, numeric range queries, and 
> ToParentBlockJoinQuery clauses so far, though. For TermQuery, we can just 
> listen for individual terms (with a TokenFilter). For range queries, we look 
> at the IndexableField itself (typically an IntPoint) before submitting the 
> Document to the IndexWriter. For a ToParentBlockJoinQuery, we can just apply 
> the matching logic to each child document to detect a match before we get to 
> the parent. The downside is that for each Query type that we want to be able 
> to evaluate at indexing time, we need to add explicit support.
>
> We're not scoring at matching time (relying on a static sort instead), which 
> allows us to remove the matched clauses altogether. That said, if the match 
> set of the conjunction of required clauses is small (at least smaller than 
> the match sets of the individual clauses), adding a "precomputed 
> intersection" filter should advance scorers more efficiently.
>
> Does Lucene's filter caching match on subsets of required clauses? So, for 
> example, if some queries contain (somewhere in a BooleanQuery tree) clauses 
> that flatten to "+A +B +C", can I cache that and also have it kick in for a 
> BooleanQuery containing "+A +B +C +D", turning it into something like 
> "+cached('+A +B +C') +D" without having to explicitly do a cache lookup for 
> "+A +B +C"?
>
> I guess another advantage of our approach is that it's effectively a 
> write-through cache, pushing the filter-matching burden to indexing time. For 
> read-heavy use-cases, that trade-off is worth it.
>
>
>
>
> On Tue, Dec 15, 2020 at 3:42 PM Robert Muir <[email protected]> wrote:
>>
>> What are you doing with positional queries though? And how does the
>> scoring work (it is unclear from your previous reply to me whether you
>> are scoring).
>>
>> Lucene has filter caching too, so if you are doing this for
>> non-scoring cases maybe something is off?
>>
>> On Tue, Dec 15, 2020 at 3:19 PM Michael Froh <[email protected]> wrote:
>> >
>> > It's conceptually similar to CommonGrams in the single-field case, though 
>> > it doesn't require terms to appear in any particular positions.
>> >
>> > It's also able to match across fields, which is where we get a lot of 
>> > benefit. We have frequently-occurring filters that get added by various 
>> > front-end layers before they hit us (which vary depending on where the 
>> > query comes from). In that regard, it's kind of like Solr's filter cache, 
>> > except that we identify the filters offline by analyzing query logs, find 
>> > common combinations of filters (especially ones where the intersection is 
>> > smaller than the smallest term's postings list), and cache the filters in 
>> > the index the next time we reindex.
>> >
>> > On Tue, Dec 15, 2020 at 9:10 AM Robert Muir <[email protected]> wrote:
>> >>
>> >> See also commongrams which is a very similar concept:
>> >> https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/commongrams
>> >>
>> >> On Tue, Dec 15, 2020 at 12:08 PM Robert Muir <[email protected]> wrote:
>> >> >
>> >> > I wonder if it can be done in a fairly clean way. This sounds similar
>> >> > to using a ShingleFilter to do this optimization, but adding some
>> >> > conditionals so that the index is smaller? Now that we have
>> >> > ConditionalTokenFilter (for branching), can the feature be implemented
>> >> > cleanly?
>> >> >
>> >> > Ideally it wouldn't require a lot of new code, something like checking
>> >> > a "set" + conditionaltokenfilter + shinglefilter?
>> >> >
>> >> > On Mon, Dec 14, 2020 at 2:37 PM Michael Froh <[email protected]> wrote:
>> >> > >
>> >> > > My team at work has a neat feature that we've built on top of Lucene 
>> >> > > that has provided a substantial (20%+) increase in maximum qps and 
>> >> > > some reduction in query latency.
>> >> > >
>> >> > > Basically, we run a training process that looks at historical queries 
>> >> > > to find frequently co-occurring combinations of required clauses, say 
>> >> > > "+A +B +C +D". Then at indexing time, if a document satisfies one of 
>> >> > > these known combinations, we add a new term to the doc, like 
>> >> > > "opto:ABCD". At query time, we can then replace the required clauses 
>> >> > > with a single TermQuery for the "optimized" term.
>> >> > >
>> >> > > It adds a little bit of extra work at indexing time and requires the 
>> >> > > offline training step, but we've found that it yields a significant 
>> >> > > boost at query time.
>> >> > >
>> >> > > We're interested in open-sourcing this feature. Is it something worth 
>> >> > > adding to Lucene? Since it doesn't require any core changes, maybe as 
>> >> > > a module?
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Processing query clause combinations at indexing time

Reply via email to