Ah. That makes sense. Thanks! (I might re-run on a larger index just to learn how it works in more detail)
On Tue, Oct 13, 2020 at 1:24 PM Adrien Grand <jpou...@gmail.com> wrote: > 100,000+ requests per core per second is a lot. :) My initial reaction is > that the query is likely so fast on that index that the bottleneck might be > rewriting or the initialization of weights/scorers (which don't get more > costly as the index gets larger) rather than actual query execution, which > means that we can't really conclude that the boolean query is faster than > the TermInSetQuery. > > Also beware than IndexSearcher#count will look at index statistics if your > queries have a single term, which would no longer work if you use this > query as a filter for another query. > > On Tue, Oct 13, 2020 at 12:51 PM Rob Audenaerde <rob.audenae...@gmail.com> > wrote: > > > I reduced the benchmark as far as I could, and now got these results, > > TermsInSet being a lot slower compared to the Terms/SHOULD. > > > > > > BenchmarkOrQuery.benchmarkTerms thrpt 5 190820.510 ± 16667.411 > > ops/s > > BenchmarkOrQuery.benchmarkTermsInSet thrpt 5 110548.345 ± 7490.169 > > ops/s > > > > > > @Fork(1) > > @Measurement(iterations = 5, time = 10) > > @OutputTimeUnit(TimeUnit.SECONDS) > > @Warmup(iterations = 3, time = 1) > > @Benchmark > > public void benchmarkTerms(final MyState myState) { > > try { > > final IndexSearcher searcher = > > myState.matchedReaders.getIndexSearcher(); > > final BooleanQuery.Builder b = new BooleanQuery.Builder(); > > > > for (final String role : myState.user.getAdditionalRoles()) { > > b.add(new TermQuery(new Term(roles, new BytesRef(role))), > > BooleanClause.Occur.SHOULD); > > } > > searcher.count(b.build()); > > > > } catch (final IOException e) { > > e.printStackTrace(); > > } > > } > > > > @Fork(1) > > @Measurement(iterations = 5, time = 10) > > @OutputTimeUnit(TimeUnit.SECONDS) > > @Warmup(iterations = 3, time = 1) > > @Benchmark > > public void benchmarkTermsInSet(final MyState myState) { > > try { > > final IndexSearcher searcher = > > myState.matchedReaders.getIndexSearcher(); > > final Set<BytesRef> roles = > > > > > myState.user.getAdditionalRoles().stream().map(BytesRef::new).collect(Collectors.toSet()); > > searcher.count(new TermInSetQuery(BenchmarkOrQuery.roles, > roles)); > > > > } catch (final IOException e) { > > e.printStackTrace(); > > } > > } > > > > > > On Tue, Oct 13, 2020 at 11:56 AM Rob Audenaerde < > rob.audenae...@gmail.com> > > wrote: > > > > > Hello Adrien, > > > > > > Thanks for the swift reply. I'll add the details: > > > > > > Lucene version: 8.6.2 > > > > > > The restrictionQuery is indeed a conjunction, it allowes for a document > > to > > > be a hit if the 'roles' field is empty as well. It's used within a > > > bigger query builder; so maybe I did something else wrong. I'll rewrite > > the > > > benchmark to just benchmark the TermsInSet and Terms. > > > > > > It never occurred (hah) to me to use Occur.FILTER, that is a good point > > to > > > check as well. > > > > > > As you put it, I would expect the results to be very similar, as I do > not > > > react the 16 terms in the TermInSet. I'll let you know what I'll find. > > > > > > On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand <jpou...@gmail.com> > wrote: > > > > > >> Can you give us a few more details: > > >> - What version of Lucene are you testing? > > >> - Are you benchmarking "restrictionQuery" on its own, or its > > conjunction > > >> with another query? > > >> > > >> You mentioned that you combine your "restrictionQuery" and the user > > query > > >> with Occur.MUST, Occur.FILTER feels more appropriate for > > >> "restrictionQuery" > > >> since it should not contribute to scoring. > > >> > > >> TermsInSetQuery automatically executes like a BooleanQuery when the > > number > > >> of clauses is less than 16, so I would not expect major performance > > >> differences between a TermInSetQuery over less than 16 terms and a > > >> BooleanQuery wrapped in a ConstantScoreQuery. > > >> > > >> On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde < > > rob.audenae...@gmail.com > > >> > > > >> wrote: > > >> > > >> > Hello, > > >> > > > >> > I'm benchmarking an application which implements security on lucene > by > > >> > adding a multivalue field "roles". If the user has one of these > roles, > > >> he > > >> > can find the document. > > >> > > > >> > I implemented this as a Boolean and query, added the original query > > and > > >> the > > >> > restriction with Occur.MUST. > > >> > > > >> > I'm having some performance issues when counting the index (>60M > > docs), > > >> so > > >> > I thought about tweaking this restriction-implementation. > > >> > > > >> > I set-up a benchmark like this: > > >> > > > >> > I generate 2M documents, Each document has a multi-value "roles" > > field. > > >> The > > >> > "roles" field in each document has 4 values, taken from > (2,2,1000,100) > > >> > unique values. > > >> > The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the > > >> first > > >> > role, 1 out of 2 for the second, 2 out of the 1000 for the third > > value, > > >> and > > >> > 1 / 100 for the fourth). > > >> > > > >> > I got a somewhat unexpected performance difference. At first, I > > >> implemented > > >> > the restriction query like this: > > >> > > > >> > for (final String role : roles) { > > >> > restrictionQuery.add(new TermQuery(new Term("roles", new > > >> > BytesRef(role))), Occur.SHOULD); > > >> > } > > >> > > > >> > I then switched to a TermInSetQuery, which I thought would be faster > > >> > as it is using constant-scores. > > >> > > > >> > final Set<BytesRef> rolesSet = > > >> > roles.stream().map(BytesRef::new).collect(Collectors.toSet()); > > >> > restrictionQuery.add(new TermInSetQuery("roles", rolesSet), > > >> Occur.SHOULD); > > >> > > > >> > > > >> > However, the TermInSetQuery has about 25% slower ops/s. Is that to > > >> > be expected? I did not, as I thought the constant-scoring would be > > >> faster. > > >> > > > >> > > >> > > >> -- > > >> Adrien > > >> > > > > > > > > -- > Adrien >