Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

Adrien Grand Tue, 13 Oct 2020 04:25:12 -0700

100,000+ requests per core per second is a lot. :) My initial reaction is
that the query is likely so fast on that index that the bottleneck might be
rewriting or the initialization of weights/scorers (which don't get more
costly as the index gets larger) rather than actual query execution, which
means that we can't really conclude that the boolean query is faster than
the TermInSetQuery.


Also beware than IndexSearcher#count will look at index statistics if your
queries have a single term, which would no longer work if you use this
query as a filter for another query.

On Tue, Oct 13, 2020 at 12:51 PM Rob Audenaerde <[email protected]>
wrote:

> I reduced the benchmark as far as I could, and now got these results,
> TermsInSet being a lot slower compared to the Terms/SHOULD.
>
>
> BenchmarkOrQuery.benchmarkTerms       thrpt    5  190820.510 ± 16667.411
> ops/s
> BenchmarkOrQuery.benchmarkTermsInSet  thrpt    5  110548.345 ±  7490.169
> ops/s
>
>
> @Fork(1)
> @Measurement(iterations = 5, time = 10)
> @OutputTimeUnit(TimeUnit.SECONDS)
> @Warmup(iterations = 3, time = 1)
> @Benchmark
> public void benchmarkTerms(final MyState myState) {
>     try {
>         final IndexSearcher searcher =
> myState.matchedReaders.getIndexSearcher();
>         final BooleanQuery.Builder b = new BooleanQuery.Builder();
>
>         for (final String role : myState.user.getAdditionalRoles()) {
>             b.add(new TermQuery(new Term(roles, new BytesRef(role))),
> BooleanClause.Occur.SHOULD);
>         }
>         searcher.count(b.build());
>
>     } catch (final IOException e) {
>         e.printStackTrace();
>     }
> }
>
> @Fork(1)
> @Measurement(iterations = 5, time = 10)
> @OutputTimeUnit(TimeUnit.SECONDS)
> @Warmup(iterations = 3, time = 1)
> @Benchmark
> public void benchmarkTermsInSet(final MyState myState) {
>     try {
>         final IndexSearcher searcher =
> myState.matchedReaders.getIndexSearcher();
>         final Set<BytesRef> roles =
>
> myState.user.getAdditionalRoles().stream().map(BytesRef::new).collect(Collectors.toSet());
>         searcher.count(new TermInSetQuery(BenchmarkOrQuery.roles, roles));
>
>     } catch (final IOException e) {
>         e.printStackTrace();
>     }
> }
>
>
> On Tue, Oct 13, 2020 at 11:56 AM Rob Audenaerde <[email protected]>
> wrote:
>
> > Hello Adrien,
> >
> > Thanks for the swift reply. I'll add the details:
> >
> > Lucene version: 8.6.2
> >
> > The restrictionQuery is indeed a conjunction, it allowes for a document
> to
> > be a hit if the 'roles' field is empty as well. It's used within a
> > bigger query builder; so maybe I did something else wrong. I'll rewrite
> the
> > benchmark to just benchmark the TermsInSet and Terms.
> >
> > It never occurred (hah) to me to use Occur.FILTER, that is a good point
> to
> > check as well.
> >
> > As you put it, I would expect the results to be very similar, as I do not
> > react the 16 terms in the TermInSet. I'll let you know what I'll find.
> >
> > On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand <[email protected]> wrote:
> >
> >> Can you give us a few more details:
> >>  - What version of Lucene are you testing?
> >>  - Are you benchmarking "restrictionQuery" on its own, or its
> conjunction
> >> with another query?
> >>
> >> You mentioned that you combine your "restrictionQuery" and the user
> query
> >> with Occur.MUST, Occur.FILTER feels more appropriate for
> >> "restrictionQuery"
> >> since it should not contribute to scoring.
> >>
> >> TermsInSetQuery automatically executes like a BooleanQuery when the
> number
> >> of clauses is less than 16, so I would not expect major performance
> >> differences between a TermInSetQuery over less than 16 terms and a
> >> BooleanQuery wrapped in a ConstantScoreQuery.
> >>
> >> On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde <
> [email protected]
> >> >
> >> wrote:
> >>
> >> > Hello,
> >> >
> >> > I'm benchmarking an application which implements security on lucene by
> >> > adding a multivalue field "roles". If the user has one of these roles,
> >> he
> >> > can find the document.
> >> >
> >> > I implemented this as a Boolean and query, added the original query
> and
> >> the
> >> > restriction with Occur.MUST.
> >> >
> >> > I'm having some performance issues when counting the index (>60M
> docs),
> >> so
> >> > I thought about tweaking this restriction-implementation.
> >> >
> >> > I set-up a benchmark like this:
> >> >
> >> > I generate 2M documents, Each document has a multi-value "roles"
> field.
> >> The
> >> > "roles" field in each document has 4 values, taken from (2,2,1000,100)
> >> > unique values.
> >> > The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the
> >> first
> >> > role, 1 out of 2 for the second, 2 out of the 1000 for the third
> value,
> >> and
> >> > 1 / 100 for the fourth).
> >> >
> >> > I got a somewhat unexpected performance difference. At first, I
> >> implemented
> >> > the restriction query like this:
> >> >
> >> > for (final String role : roles) {
> >> >     restrictionQuery.add(new TermQuery(new Term("roles", new
> >> > BytesRef(role))), Occur.SHOULD);
> >> > }
> >> >
> >> > I then switched to a TermInSetQuery, which I thought would be faster
> >> > as it is using constant-scores.
> >> >
> >> > final Set<BytesRef> rolesSet =
> >> > roles.stream().map(BytesRef::new).collect(Collectors.toSet());
> >> > restrictionQuery.add(new TermInSetQuery("roles", rolesSet),
> >> Occur.SHOULD);
> >> >
> >> >
> >> > However, the TermInSetQuery has about 25% slower ops/s. Is that to
> >> > be expected? I did not, as I thought the constant-scoring would be
> >> faster.
> >> >
> >>
> >>
> >> --
> >> Adrien
> >>
> >
>


-- 
Adrien

Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

Reply via email to