You can look at IndexSearcher.setQueryCache etc for more details. Especially LRUQueryCache.
Maybe we should celebrate a little bit if its already 80% of the way there for your use-case, but at the same time, perhaps defaults could be better. There is a lot going on here, for example decisions about which datastructures to use in the cache (sparse bitsets and so on) that all have tradeoffs. But IMO the out-of-box defaults should be as good as possible since it has the huge benefit of requiring zero effort from the user. On Tue, Dec 15, 2020 at 8:05 PM Michael Froh <[email protected]> wrote: > > We don't handle positional queries in our use-case, but that's just because > we don't happen to have many positional queries. But if we identify documents > at indexing time that contain a given phrase/slop/etc. query, then we can tag > the documents with a term that indicates that (or, more likely, tag documents > that contain that positional query AND some other queries). We can identify > documents that match a PhraseQuery, for example, by adding appending a > TokenFilter for the relevant field that "listens" for the given phrase. > > Our use-case has only needed TermQuery, numeric range queries, and > ToParentBlockJoinQuery clauses so far, though. For TermQuery, we can just > listen for individual terms (with a TokenFilter). For range queries, we look > at the IndexableField itself (typically an IntPoint) before submitting the > Document to the IndexWriter. For a ToParentBlockJoinQuery, we can just apply > the matching logic to each child document to detect a match before we get to > the parent. The downside is that for each Query type that we want to be able > to evaluate at indexing time, we need to add explicit support. > > We're not scoring at matching time (relying on a static sort instead), which > allows us to remove the matched clauses altogether. That said, if the match > set of the conjunction of required clauses is small (at least smaller than > the match sets of the individual clauses), adding a "precomputed > intersection" filter should advance scorers more efficiently. > > Does Lucene's filter caching match on subsets of required clauses? So, for > example, if some queries contain (somewhere in a BooleanQuery tree) clauses > that flatten to "+A +B +C", can I cache that and also have it kick in for a > BooleanQuery containing "+A +B +C +D", turning it into something like > "+cached('+A +B +C') +D" without having to explicitly do a cache lookup for > "+A +B +C"? > > I guess another advantage of our approach is that it's effectively a > write-through cache, pushing the filter-matching burden to indexing time. For > read-heavy use-cases, that trade-off is worth it. > > > > > On Tue, Dec 15, 2020 at 3:42 PM Robert Muir <[email protected]> wrote: >> >> What are you doing with positional queries though? And how does the >> scoring work (it is unclear from your previous reply to me whether you >> are scoring). >> >> Lucene has filter caching too, so if you are doing this for >> non-scoring cases maybe something is off? >> >> On Tue, Dec 15, 2020 at 3:19 PM Michael Froh <[email protected]> wrote: >> > >> > It's conceptually similar to CommonGrams in the single-field case, though >> > it doesn't require terms to appear in any particular positions. >> > >> > It's also able to match across fields, which is where we get a lot of >> > benefit. We have frequently-occurring filters that get added by various >> > front-end layers before they hit us (which vary depending on where the >> > query comes from). In that regard, it's kind of like Solr's filter cache, >> > except that we identify the filters offline by analyzing query logs, find >> > common combinations of filters (especially ones where the intersection is >> > smaller than the smallest term's postings list), and cache the filters in >> > the index the next time we reindex. >> > >> > On Tue, Dec 15, 2020 at 9:10 AM Robert Muir <[email protected]> wrote: >> >> >> >> See also commongrams which is a very similar concept: >> >> https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/commongrams >> >> >> >> On Tue, Dec 15, 2020 at 12:08 PM Robert Muir <[email protected]> wrote: >> >> > >> >> > I wonder if it can be done in a fairly clean way. This sounds similar >> >> > to using a ShingleFilter to do this optimization, but adding some >> >> > conditionals so that the index is smaller? Now that we have >> >> > ConditionalTokenFilter (for branching), can the feature be implemented >> >> > cleanly? >> >> > >> >> > Ideally it wouldn't require a lot of new code, something like checking >> >> > a "set" + conditionaltokenfilter + shinglefilter? >> >> > >> >> > On Mon, Dec 14, 2020 at 2:37 PM Michael Froh <[email protected]> wrote: >> >> > > >> >> > > My team at work has a neat feature that we've built on top of Lucene >> >> > > that has provided a substantial (20%+) increase in maximum qps and >> >> > > some reduction in query latency. >> >> > > >> >> > > Basically, we run a training process that looks at historical queries >> >> > > to find frequently co-occurring combinations of required clauses, say >> >> > > "+A +B +C +D". Then at indexing time, if a document satisfies one of >> >> > > these known combinations, we add a new term to the doc, like >> >> > > "opto:ABCD". At query time, we can then replace the required clauses >> >> > > with a single TermQuery for the "optimized" term. >> >> > > >> >> > > It adds a little bit of extra work at indexing time and requires the >> >> > > offline training step, but we've found that it yields a significant >> >> > > boost at query time. >> >> > > >> >> > > We're interested in open-sourcing this feature. Is it something worth >> >> > > adding to Lucene? Since it doesn't require any core changes, maybe as >> >> > > a module? >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [email protected] >> >> For additional commands, e-mail: [email protected] >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
