unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Rob Audenaerde
Hello,

I'm benchmarking an application which implements security on lucene by
adding a multivalue field "roles". If the user has one of these roles, he
can find the document.

I implemented this as a Boolean and query, added the original query and the
restriction with Occur.MUST.

I'm having some performance issues when counting the index (>60M docs), so
I thought about tweaking this restriction-implementation.

I set-up a benchmark like this:

I generate 2M documents, Each document has a multi-value "roles" field. The
"roles" field in each document has 4 values, taken from (2,2,1000,100)
unique values.
The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the first
role, 1 out of 2 for the second, 2 out of the 1000 for the third value, and
1 / 100 for the fourth).

I got a somewhat unexpected performance difference. At first, I implemented
the restriction query like this:

for (final String role : roles) {
restrictionQuery.add(new TermQuery(new Term("roles", new
BytesRef(role))), Occur.SHOULD);
}

I then switched to a TermInSetQuery, which I thought would be faster
as it is using constant-scores.

final Set rolesSet =
roles.stream().map(BytesRef::new).collect(Collectors.toSet());
restrictionQuery.add(new TermInSetQuery("roles", rolesSet), Occur.SHOULD);


However, the TermInSetQuery has about 25% slower ops/s. Is that to
be expected? I did not, as I thought the constant-scoring would be faster.


Re: Deduplication of search result with custom with custom sort

2020-10-13 Thread Dmitry Emets
I studied the Las Vegas patch and got one simple thought.
FirstPassingGroupCollector collects CollectedSearchGroup inside itself.
CollectedSearchGroup contains docId and sortValues. This is exactly what I
need. Thanks for the help!

пн, 12 окт. 2020 г. в 17:38, Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarel...@bloomberg.net>:

> > https://issues.apache.org/jira/browse/SOLR-11831 I collaborated on Las
> Vegas patch, I don't think that patch will be merged - it modifies too many
> things in the core - we ended up reimplementing it as a standalone plugin.
> Also keep in mind that the patch makes the difference only if you are
> using Solr Cloud, while it seems that you are using lucene.
>
> Do you really need to return 1000 results to the user? is this for paging
> purposes?
>
> Do you know how frequent are the groups? if they are not too frequent and
> you are not strict on 1000, you might retrieve more let's say 2000 without
> grouping and then do the deduping after..
>
> Cheers,
> Diego
>
>
> From: java-user@lucene.apache.org At: 10/12/20 13:02:46To:
> java-user@lucene.apache.org
> Subject: Re: Deduplication of search result with custom with custom sort
>
> Thank you very much for helping!
>
> There isn't much I can add about my use case. I have user-generated video
> titles and hash codes by which I can understand that these are the same
> videos. Users search videos by title and I should return the top 1000
> unique videos to them.
>
> I will try to use grouping without counting groups. Otherwise I'll look
> here https://issues.apache.org/jira/browse/SOLR-11831 or here
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
>
> Thanks again!
>
> пт, 9 окт. 2020 г. в 18:57, Jigar Shah :
>
> > My learnings dealing this problem
> >
> > We faced a similar problem before, and did the following things:
> >
> > 1) Don't request totalGroupCount, and the response was fast. as computing
> > group count is an expensive task. If you can live without groupCount.
> > Although you can approximate pagination up to total count and then group
> > count will be less so when you get empty results you stop pagination.
> > 2) Have more shards, so you can get the best out of parallel execution.
> >
> > I have seen use-cases of  60M total documents dedup doc values field,
> with
> > 4 shards.
> >
> > Query time SLA is around 5-6 seconds. Not unbearable for users.
> >
> > Let me know if you find better solution.
> >
> >
> >
> >
> >
> >
> > On Fri, Oct 9, 2020 at 11:45 AM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > dceccarel...@bloomberg.net> wrote:
> >
> > > As Erick said, can you tell us a bit more about the use case?
> > > There might be another way to achieve the same result.
> > >
> > > What are these documents?
> > > Why do you need 1000 docs per user?
> > >
> > >
> > > From: java-user@lucene.apache.org At: 10/09/20 14:25:02To:
> > > java-user@lucene.apache.org
> > > Subject: Re: Deduplication of search result with custom with custom
> sort
> > >
> > > 6_500_000 is the total count of groups in the entire collection. I only
> > > return the top 1000 to users.
> > > I use Lucene where I have documents that can have the same docvalue,
> and
> > I
> > > want to deduplicate this documents by this docvalue during search.
> > > Also, i sort my documents by multiple fields and because of this i
> can`t
> > > use DiversifiedTopDocsCollector that works with relevance score only.
> > >
> > > пт, 9 окт. 2020 г. в 16:02, Erick Erickson :
> > >
> > > > This is going to be fairly painful. You need to keep a list 6.5M
> > > > items long, sorted.
> > > >
> > > > Before diving in there, I’d really back up and ask what the use-case
> > > > is. Returning 6.5M docs to a user is useless, so are you’re doing
> > > > some kind of analytics maybe? In which case, and again
> > > > assuming you’re using Solr, Streaming Aggregation might
> > > > be a better option.
> > > >
> > > > This really sounds like an XY problem. You’re trying to solve
> problem X
> > > > and asking how to accomplish it with Y. What I’m questioning
> > > > is whether Y (grouping) is a good approach or not. Perhaps if
> > > > you explained X there’d be a better suggestion.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > > On Oct 9, 2020, at 8:19 AM, Dmitry Emets 
> wrote:
> > > > >
> > > > > I have 12_000_000 documents, 6_500_000 groups
> > > > >
> > > > > With sort: It takes around 1 sec without grouping, 2 sec with
> > grouping
> > > > and
> > > > > 12 sec with setAllGroups(true)
> > > > > Without sort: It takes around 0.2 sec without grouping, 0.6 sec
> with
> > > > > grouping and 10 sec with setAllGroups(true)
> > > > >
> > > > > Thank you, Erick, I will look into it
> > > > >
> > > > > пт, 9 окт. 2020 г. в 14:32, Erick Erickson <
> erickerick...@gmail.com
> > >:
> > > > >
> > > > >> At the Solr level, CollapsingQParserPlugin see:
> > > > >>
> > > >
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
> > > > >>
> > > 

Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Adrien Grand
Can you give us a few more details:
 - What version of Lucene are you testing?
 - Are you benchmarking "restrictionQuery" on its own, or its conjunction
with another query?

You mentioned that you combine your "restrictionQuery" and the user query
with Occur.MUST, Occur.FILTER feels more appropriate for "restrictionQuery"
since it should not contribute to scoring.

TermsInSetQuery automatically executes like a BooleanQuery when the number
of clauses is less than 16, so I would not expect major performance
differences between a TermInSetQuery over less than 16 terms and a
BooleanQuery wrapped in a ConstantScoreQuery.

On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde 
wrote:

> Hello,
>
> I'm benchmarking an application which implements security on lucene by
> adding a multivalue field "roles". If the user has one of these roles, he
> can find the document.
>
> I implemented this as a Boolean and query, added the original query and the
> restriction with Occur.MUST.
>
> I'm having some performance issues when counting the index (>60M docs), so
> I thought about tweaking this restriction-implementation.
>
> I set-up a benchmark like this:
>
> I generate 2M documents, Each document has a multi-value "roles" field. The
> "roles" field in each document has 4 values, taken from (2,2,1000,100)
> unique values.
> The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the first
> role, 1 out of 2 for the second, 2 out of the 1000 for the third value, and
> 1 / 100 for the fourth).
>
> I got a somewhat unexpected performance difference. At first, I implemented
> the restriction query like this:
>
> for (final String role : roles) {
> restrictionQuery.add(new TermQuery(new Term("roles", new
> BytesRef(role))), Occur.SHOULD);
> }
>
> I then switched to a TermInSetQuery, which I thought would be faster
> as it is using constant-scores.
>
> final Set rolesSet =
> roles.stream().map(BytesRef::new).collect(Collectors.toSet());
> restrictionQuery.add(new TermInSetQuery("roles", rolesSet), Occur.SHOULD);
>
>
> However, the TermInSetQuery has about 25% slower ops/s. Is that to
> be expected? I did not, as I thought the constant-scoring would be faster.
>


-- 
Adrien


Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Rob Audenaerde
Hello Adrien,

Thanks for the swift reply. I'll add the details:

Lucene version: 8.6.2

The restrictionQuery is indeed a conjunction, it allowes for a document to
be a hit if the 'roles' field is empty as well. It's used within a
bigger query builder; so maybe I did something else wrong. I'll rewrite the
benchmark to just benchmark the TermsInSet and Terms.

It never occurred (hah) to me to use Occur.FILTER, that is a good point to
check as well.

As you put it, I would expect the results to be very similar, as I do not
react the 16 terms in the TermInSet. I'll let you know what I'll find.

On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand  wrote:

> Can you give us a few more details:
>  - What version of Lucene are you testing?
>  - Are you benchmarking "restrictionQuery" on its own, or its conjunction
> with another query?
>
> You mentioned that you combine your "restrictionQuery" and the user query
> with Occur.MUST, Occur.FILTER feels more appropriate for "restrictionQuery"
> since it should not contribute to scoring.
>
> TermsInSetQuery automatically executes like a BooleanQuery when the number
> of clauses is less than 16, so I would not expect major performance
> differences between a TermInSetQuery over less than 16 terms and a
> BooleanQuery wrapped in a ConstantScoreQuery.
>
> On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde 
> wrote:
>
> > Hello,
> >
> > I'm benchmarking an application which implements security on lucene by
> > adding a multivalue field "roles". If the user has one of these roles, he
> > can find the document.
> >
> > I implemented this as a Boolean and query, added the original query and
> the
> > restriction with Occur.MUST.
> >
> > I'm having some performance issues when counting the index (>60M docs),
> so
> > I thought about tweaking this restriction-implementation.
> >
> > I set-up a benchmark like this:
> >
> > I generate 2M documents, Each document has a multi-value "roles" field.
> The
> > "roles" field in each document has 4 values, taken from (2,2,1000,100)
> > unique values.
> > The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the
> first
> > role, 1 out of 2 for the second, 2 out of the 1000 for the third value,
> and
> > 1 / 100 for the fourth).
> >
> > I got a somewhat unexpected performance difference. At first, I
> implemented
> > the restriction query like this:
> >
> > for (final String role : roles) {
> > restrictionQuery.add(new TermQuery(new Term("roles", new
> > BytesRef(role))), Occur.SHOULD);
> > }
> >
> > I then switched to a TermInSetQuery, which I thought would be faster
> > as it is using constant-scores.
> >
> > final Set rolesSet =
> > roles.stream().map(BytesRef::new).collect(Collectors.toSet());
> > restrictionQuery.add(new TermInSetQuery("roles", rolesSet),
> Occur.SHOULD);
> >
> >
> > However, the TermInSetQuery has about 25% slower ops/s. Is that to
> > be expected? I did not, as I thought the constant-scoring would be
> faster.
> >
>
>
> --
> Adrien
>


Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Rob Audenaerde
I reduced the benchmark as far as I could, and now got these results,
TermsInSet being a lot slower compared to the Terms/SHOULD.


BenchmarkOrQuery.benchmarkTerms   thrpt5  190820.510 ± 16667.411  ops/s
BenchmarkOrQuery.benchmarkTermsInSet  thrpt5  110548.345 ±  7490.169  ops/s


@Fork(1)
@Measurement(iterations = 5, time = 10)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 3, time = 1)
@Benchmark
public void benchmarkTerms(final MyState myState) {
try {
final IndexSearcher searcher =
myState.matchedReaders.getIndexSearcher();
final BooleanQuery.Builder b = new BooleanQuery.Builder();

for (final String role : myState.user.getAdditionalRoles()) {
b.add(new TermQuery(new Term(roles, new BytesRef(role))),
BooleanClause.Occur.SHOULD);
}
searcher.count(b.build());

} catch (final IOException e) {
e.printStackTrace();
}
}

@Fork(1)
@Measurement(iterations = 5, time = 10)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 3, time = 1)
@Benchmark
public void benchmarkTermsInSet(final MyState myState) {
try {
final IndexSearcher searcher =
myState.matchedReaders.getIndexSearcher();
final Set roles =
myState.user.getAdditionalRoles().stream().map(BytesRef::new).collect(Collectors.toSet());
searcher.count(new TermInSetQuery(BenchmarkOrQuery.roles, roles));

} catch (final IOException e) {
e.printStackTrace();
}
}


On Tue, Oct 13, 2020 at 11:56 AM Rob Audenaerde 
wrote:

> Hello Adrien,
>
> Thanks for the swift reply. I'll add the details:
>
> Lucene version: 8.6.2
>
> The restrictionQuery is indeed a conjunction, it allowes for a document to
> be a hit if the 'roles' field is empty as well. It's used within a
> bigger query builder; so maybe I did something else wrong. I'll rewrite the
> benchmark to just benchmark the TermsInSet and Terms.
>
> It never occurred (hah) to me to use Occur.FILTER, that is a good point to
> check as well.
>
> As you put it, I would expect the results to be very similar, as I do not
> react the 16 terms in the TermInSet. I'll let you know what I'll find.
>
> On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand  wrote:
>
>> Can you give us a few more details:
>>  - What version of Lucene are you testing?
>>  - Are you benchmarking "restrictionQuery" on its own, or its conjunction
>> with another query?
>>
>> You mentioned that you combine your "restrictionQuery" and the user query
>> with Occur.MUST, Occur.FILTER feels more appropriate for
>> "restrictionQuery"
>> since it should not contribute to scoring.
>>
>> TermsInSetQuery automatically executes like a BooleanQuery when the number
>> of clauses is less than 16, so I would not expect major performance
>> differences between a TermInSetQuery over less than 16 terms and a
>> BooleanQuery wrapped in a ConstantScoreQuery.
>>
>> On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde > >
>> wrote:
>>
>> > Hello,
>> >
>> > I'm benchmarking an application which implements security on lucene by
>> > adding a multivalue field "roles". If the user has one of these roles,
>> he
>> > can find the document.
>> >
>> > I implemented this as a Boolean and query, added the original query and
>> the
>> > restriction with Occur.MUST.
>> >
>> > I'm having some performance issues when counting the index (>60M docs),
>> so
>> > I thought about tweaking this restriction-implementation.
>> >
>> > I set-up a benchmark like this:
>> >
>> > I generate 2M documents, Each document has a multi-value "roles" field.
>> The
>> > "roles" field in each document has 4 values, taken from (2,2,1000,100)
>> > unique values.
>> > The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the
>> first
>> > role, 1 out of 2 for the second, 2 out of the 1000 for the third value,
>> and
>> > 1 / 100 for the fourth).
>> >
>> > I got a somewhat unexpected performance difference. At first, I
>> implemented
>> > the restriction query like this:
>> >
>> > for (final String role : roles) {
>> > restrictionQuery.add(new TermQuery(new Term("roles", new
>> > BytesRef(role))), Occur.SHOULD);
>> > }
>> >
>> > I then switched to a TermInSetQuery, which I thought would be faster
>> > as it is using constant-scores.
>> >
>> > final Set rolesSet =
>> > roles.stream().map(BytesRef::new).collect(Collectors.toSet());
>> > restrictionQuery.add(new TermInSetQuery("roles", rolesSet),
>> Occur.SHOULD);
>> >
>> >
>> > However, the TermInSetQuery has about 25% slower ops/s. Is that to
>> > be expected? I did not, as I thought the constant-scoring would be
>> faster.
>> >
>>
>>
>> --
>> Adrien
>>
>


Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Adrien Grand
100,000+ requests per core per second is a lot. :) My initial reaction is
that the query is likely so fast on that index that the bottleneck might be
rewriting or the initialization of weights/scorers (which don't get more
costly as the index gets larger) rather than actual query execution, which
means that we can't really conclude that the boolean query is faster than
the TermInSetQuery.

Also beware than IndexSearcher#count will look at index statistics if your
queries have a single term, which would no longer work if you use this
query as a filter for another query.

On Tue, Oct 13, 2020 at 12:51 PM Rob Audenaerde 
wrote:

> I reduced the benchmark as far as I could, and now got these results,
> TermsInSet being a lot slower compared to the Terms/SHOULD.
>
>
> BenchmarkOrQuery.benchmarkTerms   thrpt5  190820.510 ± 16667.411
> ops/s
> BenchmarkOrQuery.benchmarkTermsInSet  thrpt5  110548.345 ±  7490.169
> ops/s
>
>
> @Fork(1)
> @Measurement(iterations = 5, time = 10)
> @OutputTimeUnit(TimeUnit.SECONDS)
> @Warmup(iterations = 3, time = 1)
> @Benchmark
> public void benchmarkTerms(final MyState myState) {
> try {
> final IndexSearcher searcher =
> myState.matchedReaders.getIndexSearcher();
> final BooleanQuery.Builder b = new BooleanQuery.Builder();
>
> for (final String role : myState.user.getAdditionalRoles()) {
> b.add(new TermQuery(new Term(roles, new BytesRef(role))),
> BooleanClause.Occur.SHOULD);
> }
> searcher.count(b.build());
>
> } catch (final IOException e) {
> e.printStackTrace();
> }
> }
>
> @Fork(1)
> @Measurement(iterations = 5, time = 10)
> @OutputTimeUnit(TimeUnit.SECONDS)
> @Warmup(iterations = 3, time = 1)
> @Benchmark
> public void benchmarkTermsInSet(final MyState myState) {
> try {
> final IndexSearcher searcher =
> myState.matchedReaders.getIndexSearcher();
> final Set roles =
>
> myState.user.getAdditionalRoles().stream().map(BytesRef::new).collect(Collectors.toSet());
> searcher.count(new TermInSetQuery(BenchmarkOrQuery.roles, roles));
>
> } catch (final IOException e) {
> e.printStackTrace();
> }
> }
>
>
> On Tue, Oct 13, 2020 at 11:56 AM Rob Audenaerde 
> wrote:
>
> > Hello Adrien,
> >
> > Thanks for the swift reply. I'll add the details:
> >
> > Lucene version: 8.6.2
> >
> > The restrictionQuery is indeed a conjunction, it allowes for a document
> to
> > be a hit if the 'roles' field is empty as well. It's used within a
> > bigger query builder; so maybe I did something else wrong. I'll rewrite
> the
> > benchmark to just benchmark the TermsInSet and Terms.
> >
> > It never occurred (hah) to me to use Occur.FILTER, that is a good point
> to
> > check as well.
> >
> > As you put it, I would expect the results to be very similar, as I do not
> > react the 16 terms in the TermInSet. I'll let you know what I'll find.
> >
> > On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand  wrote:
> >
> >> Can you give us a few more details:
> >>  - What version of Lucene are you testing?
> >>  - Are you benchmarking "restrictionQuery" on its own, or its
> conjunction
> >> with another query?
> >>
> >> You mentioned that you combine your "restrictionQuery" and the user
> query
> >> with Occur.MUST, Occur.FILTER feels more appropriate for
> >> "restrictionQuery"
> >> since it should not contribute to scoring.
> >>
> >> TermsInSetQuery automatically executes like a BooleanQuery when the
> number
> >> of clauses is less than 16, so I would not expect major performance
> >> differences between a TermInSetQuery over less than 16 terms and a
> >> BooleanQuery wrapped in a ConstantScoreQuery.
> >>
> >> On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde <
> rob.audenae...@gmail.com
> >> >
> >> wrote:
> >>
> >> > Hello,
> >> >
> >> > I'm benchmarking an application which implements security on lucene by
> >> > adding a multivalue field "roles". If the user has one of these roles,
> >> he
> >> > can find the document.
> >> >
> >> > I implemented this as a Boolean and query, added the original query
> and
> >> the
> >> > restriction with Occur.MUST.
> >> >
> >> > I'm having some performance issues when counting the index (>60M
> docs),
> >> so
> >> > I thought about tweaking this restriction-implementation.
> >> >
> >> > I set-up a benchmark like this:
> >> >
> >> > I generate 2M documents, Each document has a multi-value "roles"
> field.
> >> The
> >> > "roles" field in each document has 4 values, taken from (2,2,1000,100)
> >> > unique values.
> >> > The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the
> >> first
> >> > role, 1 out of 2 for the second, 2 out of the 1000 for the third
> value,
> >> and
> >> > 1 / 100 for the fourth).
> >> >
> >> > I got a somewhat unexpected performance difference. At first, I
> >> implemented
> >> > the restriction query like this:
> >> >
> >> > for (final String role : roles) {
> >> > restrictionQuery.add(new TermQuery(new Term("roles

Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Rob Audenaerde
Ah. That makes sense. Thanks!

(I might re-run on a larger index just to learn how it works in more detail)

On Tue, Oct 13, 2020 at 1:24 PM Adrien Grand  wrote:

> 100,000+ requests per core per second is a lot. :) My initial reaction is
> that the query is likely so fast on that index that the bottleneck might be
> rewriting or the initialization of weights/scorers (which don't get more
> costly as the index gets larger) rather than actual query execution, which
> means that we can't really conclude that the boolean query is faster than
> the TermInSetQuery.
>
> Also beware than IndexSearcher#count will look at index statistics if your
> queries have a single term, which would no longer work if you use this
> query as a filter for another query.
>
> On Tue, Oct 13, 2020 at 12:51 PM Rob Audenaerde 
> wrote:
>
> > I reduced the benchmark as far as I could, and now got these results,
> > TermsInSet being a lot slower compared to the Terms/SHOULD.
> >
> >
> > BenchmarkOrQuery.benchmarkTerms   thrpt5  190820.510 ± 16667.411
> > ops/s
> > BenchmarkOrQuery.benchmarkTermsInSet  thrpt5  110548.345 ±  7490.169
> > ops/s
> >
> >
> > @Fork(1)
> > @Measurement(iterations = 5, time = 10)
> > @OutputTimeUnit(TimeUnit.SECONDS)
> > @Warmup(iterations = 3, time = 1)
> > @Benchmark
> > public void benchmarkTerms(final MyState myState) {
> > try {
> > final IndexSearcher searcher =
> > myState.matchedReaders.getIndexSearcher();
> > final BooleanQuery.Builder b = new BooleanQuery.Builder();
> >
> > for (final String role : myState.user.getAdditionalRoles()) {
> > b.add(new TermQuery(new Term(roles, new BytesRef(role))),
> > BooleanClause.Occur.SHOULD);
> > }
> > searcher.count(b.build());
> >
> > } catch (final IOException e) {
> > e.printStackTrace();
> > }
> > }
> >
> > @Fork(1)
> > @Measurement(iterations = 5, time = 10)
> > @OutputTimeUnit(TimeUnit.SECONDS)
> > @Warmup(iterations = 3, time = 1)
> > @Benchmark
> > public void benchmarkTermsInSet(final MyState myState) {
> > try {
> > final IndexSearcher searcher =
> > myState.matchedReaders.getIndexSearcher();
> > final Set roles =
> >
> >
> myState.user.getAdditionalRoles().stream().map(BytesRef::new).collect(Collectors.toSet());
> > searcher.count(new TermInSetQuery(BenchmarkOrQuery.roles,
> roles));
> >
> > } catch (final IOException e) {
> > e.printStackTrace();
> > }
> > }
> >
> >
> > On Tue, Oct 13, 2020 at 11:56 AM Rob Audenaerde <
> rob.audenae...@gmail.com>
> > wrote:
> >
> > > Hello Adrien,
> > >
> > > Thanks for the swift reply. I'll add the details:
> > >
> > > Lucene version: 8.6.2
> > >
> > > The restrictionQuery is indeed a conjunction, it allowes for a document
> > to
> > > be a hit if the 'roles' field is empty as well. It's used within a
> > > bigger query builder; so maybe I did something else wrong. I'll rewrite
> > the
> > > benchmark to just benchmark the TermsInSet and Terms.
> > >
> > > It never occurred (hah) to me to use Occur.FILTER, that is a good point
> > to
> > > check as well.
> > >
> > > As you put it, I would expect the results to be very similar, as I do
> not
> > > react the 16 terms in the TermInSet. I'll let you know what I'll find.
> > >
> > > On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand 
> wrote:
> > >
> > >> Can you give us a few more details:
> > >>  - What version of Lucene are you testing?
> > >>  - Are you benchmarking "restrictionQuery" on its own, or its
> > conjunction
> > >> with another query?
> > >>
> > >> You mentioned that you combine your "restrictionQuery" and the user
> > query
> > >> with Occur.MUST, Occur.FILTER feels more appropriate for
> > >> "restrictionQuery"
> > >> since it should not contribute to scoring.
> > >>
> > >> TermsInSetQuery automatically executes like a BooleanQuery when the
> > number
> > >> of clauses is less than 16, so I would not expect major performance
> > >> differences between a TermInSetQuery over less than 16 terms and a
> > >> BooleanQuery wrapped in a ConstantScoreQuery.
> > >>
> > >> On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde <
> > rob.audenae...@gmail.com
> > >> >
> > >> wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > I'm benchmarking an application which implements security on lucene
> by
> > >> > adding a multivalue field "roles". If the user has one of these
> roles,
> > >> he
> > >> > can find the document.
> > >> >
> > >> > I implemented this as a Boolean and query, added the original query
> > and
> > >> the
> > >> > restriction with Occur.MUST.
> > >> >
> > >> > I'm having some performance issues when counting the index (>60M
> > docs),
> > >> so
> > >> > I thought about tweaking this restriction-implementation.
> > >> >
> > >> > I set-up a benchmark like this:
> > >> >
> > >> > I generate 2M documents, Each document has a multi-value "roles"
> > field.
> > >> The
> > >> > "roles" field in each document has 4 values, taken from
> (2,2,1000,100