Re: An interesting case

2021-06-08 Thread baris . kazar
gy as
>> possible computing the number of matches of the query.
>>
>> On Tue, Jun 8, 2021 at 6:28 PM mailto:baris.ka...@oracle.com>> wrote:
>>
>>> i am currently happy with Lucene performance but i want to
understand
>>> and speedup further
>>>
>>> by limiting the results concretely. So i still donot know
why totalHits
>>> and scoredocs report
>>>
>>> different number of hits.
>>>
>>>
>>> Best regards
>>>
>>>
>>> On 6/8/21 2:52 AM, Baris Kazar wrote:
>>>> my worry is actually about the lucene's performance.
>>>>
>>>> if lucene collects thousands of hits instead of actually n
(<<< a
>>>> couple of 1000s) hits, then this creates performance issue.
>>>>
>>>> ScoreDoc array is ok as i mentioned ie, it has size n.
>>>> i will check count api.
>>>>
>>>> Best regards
>>>>


>>>>
>>>> *From:* Adrien Grand mailto:jpou...@gmail.com>>
>>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
>>>> *To:* Lucene Users Mailing List
>>>> *Cc:* Baris Kazar
>>>> *Subject:* Re: An interesting case
>>>> When you call IndexSearcher#search(Query query, int n),
there are two
>>>> cases:
>>>>   - either your query matches n hits or more, and the
TopDocs object
>>>> will have a ScoreDoc[] array that contains the n best
scoring hits
>>>> sorted by descending score,
>>>>   - or your query matches less then n hits and then the
TopDocs object
>>>> will have all matches in the ScoreDoc[] array, sorted by
descending
>>> score.
>>>> In both cases, TopDocs#totalHits gives information about
the total
>>>> number of matches of the query. On older versions of Lucene
(<7.0)
>>>> this is an integer that is always accurate, while on more
recent
>>>> versions of Lucene (>= 8.0) it is a lower bound of the
total number of
>>>> matches. It typically returns the number of collected documents
>>>> indeed, though this is an implementation detail that might
change in
>>>> the future.
>>>>
>>>> If you want to count the number of matches of a Query
precisely, you
>>>> can use IndexSearcher#count.
>>>>
>>>> On Tue, Jun 8, 2021 at 7:51 AM mailto:baris.ka...@oracle.com>
>>>> <mailto:baris.ka...@oracle.com
<mailto:baris.ka...@oracle.com>>> wrote:
>>>>
>>>>
>>>

https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$

<https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$>

>>>
>>>>  <
>>>

https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$

<https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$>

>>>
>>>>
>>>>  looks like someone else also had this problem, too.
>>>>
>>>>  Any suggestions please?
>>>>
>>>>  Best regards
>>>>
>>>>
>>>>  On 6/8/21 1:36 AM, baris.ka...@oracle.com
<mailto:baris.ka...@oracle.com>
>>>>  <mailto:baris.ka...@oracle.com
<mailto:baris.ka...@oracle.com>> wrote:
>>>>  > Hi,-
>>>>  >
>>>>  >  I use IndexSearcher.search API with two parameters
like Query
>>>>  and int
>>>>  > number (i set as 20).
>>>>  >
   

Re: An interesting case

2021-06-08 Thread baris . kazar

May i please again suggest?

the Javadocs need to be enhanced for Lucene

There needs to be more info and explain parameters and

more importantly in terms of performance why these two classes 
(TopScoreDocsCollector vs IndexSearcher) differ for performance.



Thanks


On 6/8/21 2:07 PM, baris.ka...@oracle.com wrote:


yes i see sometimes 4000+, sometimes 3000+ hits from totalhits.

So TopScoreDocsCollector is working underneath IndexSearcher.search 
api, right?


in other words TopScoreDocsCollector will be saving time, right?

Thanks


On 6/8/21 1:27 PM, Adrien Grand wrote:
Yes, for instance if you care about the top 10 hits only, you could 
call TopScoreDocsCollector.create(10, null, 10). By default, 
IndexSearcher is configured to count at least 1,000 hits, and creates 
its top docs collector with TopScoreDocsCollector.create(10, null, 1000).


On Tue, Jun 8, 2021 at 7:19 PM <mailto:baris.ka...@oracle.com>> wrote:


Ok i think you meant something else here.

you are not refering to total number of hits calculation or the
mismatch, right?



so to make lucene minimum work to reach the matched docs


TopScoreDocCollector should be used, right?


Let me check this class.

Thanks


On 6/8/21 1:16 PM, baris.ka...@oracle.com
<mailto:baris.ka...@oracle.com> wrote:
> Adrien my concern is not actually the number mismatch
>
> as i mentioned it is the performance.
>
>
> seeing those numbers mismatch it seems that lucene is still
doing same
>
> amount of work to get results no matter how many results you
need in
> the indexsearcher search api.
>
>
> i thought i was clear on that.
>
>
> Lucene should not spend any energy for the count as scoredocs
already
> has that.
>
> But seeing totalhits high number, that worries me as i
explained above.
>
>
> Best regards
>
>
> On 6/8/21 1:12 PM, Adrien Grand wrote:
>> If you don't need any information about the total hit count,
you could
>> create a TopScoreDocCollector that has the same value for numHits
>> and totalHitsThreshold. This way Lucene will spend as little
energy as
>> possible computing the number of matches of the query.
>>
>> On Tue, Jun 8, 2021 at 6:28 PM mailto:baris.ka...@oracle.com>> wrote:
>>
>>> i am currently happy with Lucene performance but i want to
understand
>>> and speedup further
>>>
>>> by limiting the results concretely. So i still donot know why
totalHits
>>> and scoredocs report
>>>
>>> different number of hits.
>>>
>>>
>>> Best regards
>>>
>>>
>>> On 6/8/21 2:52 AM, Baris Kazar wrote:
>>>> my worry is actually about the lucene's performance.
>>>>
>>>> if lucene collects thousands of hits instead of actually n
(<<< a
>>>> couple of 1000s) hits, then this creates performance issue.
>>>>
>>>> ScoreDoc array is ok as i mentioned ie, it has size n.
>>>> i will check count api.
>>>>
>>>> Best regards
>>>>


>>>>
>>>> *From:* Adrien Grand mailto:jpou...@gmail.com>>
>>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
>>>> *To:* Lucene Users Mailing List
>>>> *Cc:* Baris Kazar
>>>> *Subject:* Re: An interesting case
>>>> When you call IndexSearcher#search(Query query, int n),
there are two
>>>> cases:
>>>>   - either your query matches n hits or more, and the
TopDocs object
>>>> will have a ScoreDoc[] array that contains the n best
scoring hits
>>>> sorted by descending score,
>>>>   - or your query matches less then n hits and then the
TopDocs object
>>>> will have all matches in the ScoreDoc[] array, sorted by
descending
>>> score.
>>>> In both cases, TopDocs#totalHits gives information about the
total
>>>> number of matches of the query. On older versions of Lucene
(<7.0)
>>>> this is an integer that is always accurate, while on more recent
>>>> versions of Lucene (>= 8.0) it is a lower bound of the total
number of
>>>> matches. It typically returns the number of collected documents
>>>> indeed, though this is an implementation detail that might
change i

Re: An interesting case

2021-06-08 Thread baris . kazar

yes i see sometimes 4000+, sometimes 3000+ hits from totalhits.

So TopScoreDocsCollector is working underneath IndexSearcher.search api, 
right?


in other words TopScoreDocsCollector will be saving time, right?

Thanks


On 6/8/21 1:27 PM, Adrien Grand wrote:
Yes, for instance if you care about the top 10 hits only, you could 
call TopScoreDocsCollector.create(10, null, 10). By default, 
IndexSearcher is configured to count at least 1,000 hits, and creates 
its top docs collector with TopScoreDocsCollector.create(10, null, 1000).


On Tue, Jun 8, 2021 at 7:19 PM <mailto:baris.ka...@oracle.com>> wrote:


Ok i think you meant something else here.

you are not refering to total number of hits calculation or the
mismatch, right?



so to make lucene minimum work to reach the matched docs


TopScoreDocCollector should be used, right?


Let me check this class.

Thanks


On 6/8/21 1:16 PM, baris.ka...@oracle.com
<mailto:baris.ka...@oracle.com> wrote:
> Adrien my concern is not actually the number mismatch
>
> as i mentioned it is the performance.
>
>
> seeing those numbers mismatch it seems that lucene is still
doing same
>
> amount of work to get results no matter how many results you
need in
> the indexsearcher search api.
>
>
> i thought i was clear on that.
>
>
> Lucene should not spend any energy for the count as scoredocs
already
> has that.
>
> But seeing totalhits high number, that worries me as i explained
above.
>
>
> Best regards
>
>
> On 6/8/21 1:12 PM, Adrien Grand wrote:
>> If you don't need any information about the total hit count,
you could
>> create a TopScoreDocCollector that has the same value for numHits
>> and totalHitsThreshold. This way Lucene will spend as little
energy as
>> possible computing the number of matches of the query.
>>
>> On Tue, Jun 8, 2021 at 6:28 PM mailto:baris.ka...@oracle.com>> wrote:
>>
>>> i am currently happy with Lucene performance but i want to
understand
>>> and speedup further
>>>
>>> by limiting the results concretely. So i still donot know why
totalHits
>>> and scoredocs report
>>>
>>> different number of hits.
>>>
>>>
>>> Best regards
>>>
>>>
>>> On 6/8/21 2:52 AM, Baris Kazar wrote:
>>>> my worry is actually about the lucene's performance.
>>>>
>>>> if lucene collects thousands of hits instead of actually n (<<< a
>>>> couple of 1000s) hits, then this creates performance issue.
>>>>
>>>> ScoreDoc array is ok as i mentioned ie, it has size n.
>>>> i will check count api.
>>>>
>>>> Best regards
>>>>


>>>>
>>>> *From:* Adrien Grand mailto:jpou...@gmail.com>>
>>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
>>>> *To:* Lucene Users Mailing List
>>>> *Cc:* Baris Kazar
>>>> *Subject:* Re: An interesting case
>>>> When you call IndexSearcher#search(Query query, int n), there
are two
>>>> cases:
>>>>   - either your query matches n hits or more, and the TopDocs
object
>>>> will have a ScoreDoc[] array that contains the n best scoring
hits
>>>> sorted by descending score,
>>>>   - or your query matches less then n hits and then the
TopDocs object
>>>> will have all matches in the ScoreDoc[] array, sorted by
descending
>>> score.
>>>> In both cases, TopDocs#totalHits gives information about the
total
>>>> number of matches of the query. On older versions of Lucene
(<7.0)
>>>> this is an integer that is always accurate, while on more recent
>>>> versions of Lucene (>= 8.0) it is a lower bound of the total
number of
>>>> matches. It typically returns the number of collected documents
>>>> indeed, though this is an implementation detail that might
change in
>>>> the future.
>>>>
>>>> If you want to count the number of matches of a Query
precisely, you
>>>> can use IndexSearcher#count.
>>>>
>>>> On Tue, Jun 8, 2021 at 7:51 AM mailto:baris.ka...@oracle.com>
>>>>

Re: An interesting case

2021-06-08 Thread Adrien Grand
Yes, for instance if you care about the top 10 hits only, you could call
TopScoreDocsCollector.create(10, null, 10). By default, IndexSearcher is
configured to count at least 1,000 hits, and creates its top docs collector
with TopScoreDocsCollector.create(10, null, 1000).

On Tue, Jun 8, 2021 at 7:19 PM  wrote:

> Ok i think you meant something else here.
>
> you are not refering to total number of hits calculation or the
> mismatch, right?
>
>
>
> so to make lucene minimum work to reach the matched docs
>
>
> TopScoreDocCollector should be used, right?
>
>
> Let me check this class.
>
> Thanks
>
>
> On 6/8/21 1:16 PM, baris.ka...@oracle.com wrote:
> > Adrien my concern is not actually the number mismatch
> >
> > as i mentioned it is the performance.
> >
> >
> > seeing those numbers mismatch it seems that lucene is still doing same
> >
> > amount of work to get results no matter how many results you need in
> > the indexsearcher search api.
> >
> >
> > i thought i was clear on that.
> >
> >
> > Lucene should not spend any energy for the count as scoredocs already
> > has that.
> >
> > But seeing totalhits high number, that worries me as i explained above.
> >
> >
> > Best regards
> >
> >
> > On 6/8/21 1:12 PM, Adrien Grand wrote:
> >> If you don't need any information about the total hit count, you could
> >> create a TopScoreDocCollector that has the same value for numHits
> >> and totalHitsThreshold. This way Lucene will spend as little energy as
> >> possible computing the number of matches of the query.
> >>
> >> On Tue, Jun 8, 2021 at 6:28 PM  wrote:
> >>
> >>> i am currently happy with Lucene performance but i want to understand
> >>> and speedup further
> >>>
> >>> by limiting the results concretely. So i still donot know why totalHits
> >>> and scoredocs report
> >>>
> >>> different number of hits.
> >>>
> >>>
> >>> Best regards
> >>>
> >>>
> >>> On 6/8/21 2:52 AM, Baris Kazar wrote:
> >>>> my worry is actually about the lucene's performance.
> >>>>
> >>>> if lucene collects thousands of hits instead of actually n (<<< a
> >>>> couple of 1000s) hits, then this creates performance issue.
> >>>>
> >>>> ScoreDoc array is ok as i mentioned ie, it has size n.
> >>>> i will check count api.
> >>>>
> >>>> Best regards
> >>>>
> 
> >>>>
> >>>> *From:* Adrien Grand 
> >>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
> >>>> *To:* Lucene Users Mailing List
> >>>> *Cc:* Baris Kazar
> >>>> *Subject:* Re: An interesting case
> >>>> When you call IndexSearcher#search(Query query, int n), there are two
> >>>> cases:
> >>>>   - either your query matches n hits or more, and the TopDocs object
> >>>> will have a ScoreDoc[] array that contains the n best scoring hits
> >>>> sorted by descending score,
> >>>>   - or your query matches less then n hits and then the TopDocs object
> >>>> will have all matches in the ScoreDoc[] array, sorted by descending
> >>> score.
> >>>> In both cases, TopDocs#totalHits gives information about the total
> >>>> number of matches of the query. On older versions of Lucene (<7.0)
> >>>> this is an integer that is always accurate, while on more recent
> >>>> versions of Lucene (>= 8.0) it is a lower bound of the total number of
> >>>> matches. It typically returns the number of collected documents
> >>>> indeed, though this is an implementation detail that might change in
> >>>> the future.
> >>>>
> >>>> If you want to count the number of matches of a Query precisely, you
> >>>> can use IndexSearcher#count.
> >>>>
> >>>> On Tue, Jun 8, 2021 at 7:51 AM  >>>> <mailto:baris.ka...@oracle.com>> wrote:
> >>>>
> >>>>
> >>>
> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$
> >>>
> >>>>  

Re: An interesting case

2021-06-08 Thread baris . kazar

Ok i think you meant something else here.

you are not refering to total number of hits calculation or the 
mismatch, right?




so to make lucene minimum work to reach the matched docs


TopScoreDocCollector should be used, right?


Let me check this class.

Thanks


On 6/8/21 1:16 PM, baris.ka...@oracle.com wrote:

Adrien my concern is not actually the number mismatch

as i mentioned it is the performance.


seeing those numbers mismatch it seems that lucene is still doing same

amount of work to get results no matter how many results you need in 
the indexsearcher search api.



i thought i was clear on that.


Lucene should not spend any energy for the count as scoredocs already 
has that.


But seeing totalhits high number, that worries me as i explained above.


Best regards


On 6/8/21 1:12 PM, Adrien Grand wrote:

If you don't need any information about the total hit count, you could
create a TopScoreDocCollector that has the same value for numHits
and totalHitsThreshold. This way Lucene will spend as little energy as
possible computing the number of matches of the query.

On Tue, Jun 8, 2021 at 6:28 PM  wrote:


i am currently happy with Lucene performance but i want to understand
and speedup further

by limiting the results concretely. So i still donot know why totalHits
and scoredocs report

different number of hits.


Best regards


On 6/8/21 2:52 AM, Baris Kazar wrote:

my worry is actually about the lucene's performance.

if lucene collects thousands of hits instead of actually n (<<< a
couple of 1000s) hits, then this creates performance issue.

ScoreDoc array is ok as i mentioned ie, it has size n.
i will check count api.

Best regards
 


*From:* Adrien Grand 
*Sent:* Tuesday, June 8, 2021 2:46 AM
*To:* Lucene Users Mailing List
*Cc:* Baris Kazar
*Subject:* Re: An interesting case
When you call IndexSearcher#search(Query query, int n), there are two
cases:
  - either your query matches n hits or more, and the TopDocs object
will have a ScoreDoc[] array that contains the n best scoring hits
sorted by descending score,
  - or your query matches less then n hits and then the TopDocs object
will have all matches in the ScoreDoc[] array, sorted by descending

score.

In both cases, TopDocs#totalHits gives information about the total
number of matches of the query. On older versions of Lucene (<7.0)
this is an integer that is always accurate, while on more recent
versions of Lucene (>= 8.0) it is a lower bound of the total number of
matches. It typically returns the number of collected documents
indeed, though this is an implementation detail that might change in
the future.

If you want to count the number of matches of a Query precisely, you
can use IndexSearcher#count.

On Tue, Jun 8, 2021 at 7:51 AM mailto:baris.ka...@oracle.com>> wrote:


https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$ 


 <
https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$ 



 looks like someone else also had this problem, too.

 Any suggestions please?

 Best regards


 On 6/8/21 1:36 AM, baris.ka...@oracle.com
 <mailto:baris.ka...@oracle.com> wrote:
 > Hi,-
 >
 >  I use IndexSearcher.search API with two parameters like Query
 and int
 > number (i set as 20).
 >
 > However, when i look at the TopDocs object which is the result
 of this
 > above API call
 >
 > i see thousands of hits from totalhits. Is this inaccurate or
 Lucene
 > is doing actually search based on that many results?
 >
 > But when i iterate over result of above API call's scoreDocs
 object i
 > get int number of hits (ie, 20 hits).
 >
 >
 > I am trying to find out why
 org.apache.lucene.search.Topdocs.TotalHits
 > report a number of collected results than
 >
 > the actual number of results. I see on the order of couple of
 > thousands vs 20.
 >
 >
 > Best regards
 >
 >
 >

-
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 <mailto:java-user-unsubscr...@lucene.apache.org>
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 <mailto:java-user-h...@lucene.apache.org>



--
Adrien




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: An interesting case

2021-06-08 Thread baris . kazar

Adrien my concern is not actually the number mismatch

as i mentioned it is the performance.


seeing those numbers mismatch it seems that lucene is still doing same

amount of work to get results no matter how many results you need in the 
indexsearcher search api.



i thought i was clear on that.


Lucene should not spend any energy for the count as scoredocs already 
has that.


But seeing totalhits high number, that worries me as i explained above.


Best regards


On 6/8/21 1:12 PM, Adrien Grand wrote:

If you don't need any information about the total hit count, you could
create a TopScoreDocCollector that has the same value for numHits
and totalHitsThreshold. This way Lucene will spend as little energy as
possible computing the number of matches of the query.

On Tue, Jun 8, 2021 at 6:28 PM  wrote:


i am currently happy with Lucene performance but i want to understand
and speedup further

by limiting the results concretely. So i still donot know why totalHits
and scoredocs report

different number of hits.


Best regards


On 6/8/21 2:52 AM, Baris Kazar wrote:

my worry is actually about the lucene's performance.

if lucene collects thousands of hits instead of actually n (<<< a
couple of 1000s) hits, then this creates performance issue.

ScoreDoc array is ok as i mentioned ie, it has size n.
i will check count api.

Best regards

*From:* Adrien Grand 
*Sent:* Tuesday, June 8, 2021 2:46 AM
*To:* Lucene Users Mailing List
*Cc:* Baris Kazar
*Subject:* Re: An interesting case
When you call IndexSearcher#search(Query query, int n), there are two
cases:
  - either your query matches n hits or more, and the TopDocs object
will have a ScoreDoc[] array that contains the n best scoring hits
sorted by descending score,
  - or your query matches less then n hits and then the TopDocs object
will have all matches in the ScoreDoc[] array, sorted by descending

score.

In both cases, TopDocs#totalHits gives information about the total
number of matches of the query. On older versions of Lucene (<7.0)
this is an integer that is always accurate, while on more recent
versions of Lucene (>= 8.0) it is a lower bound of the total number of
matches. It typically returns the number of collected documents
indeed, though this is an implementation detail that might change in
the future.

If you want to count the number of matches of a Query precisely, you
can use IndexSearcher#count.

On Tue, Jun 8, 2021 at 7:51 AM mailto:baris.ka...@oracle.com>> wrote:



https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$

 <

https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$


 looks like someone else also had this problem, too.

 Any suggestions please?

 Best regards


 On 6/8/21 1:36 AM, baris.ka...@oracle.com
 <mailto:baris.ka...@oracle.com> wrote:
 > Hi,-
 >
 >  I use IndexSearcher.search API with two parameters like Query
 and int
 > number (i set as 20).
 >
 > However, when i look at the TopDocs object which is the result
 of this
 > above API call
 >
 > i see thousands of hits from totalhits. Is this inaccurate or
 Lucene
 > is doing actually search based on that many results?
 >
 > But when i iterate over result of above API call's scoreDocs
 object i
 > get int number of hits (ie, 20 hits).
 >
 >
 > I am trying to find out why
 org.apache.lucene.search.Topdocs.TotalHits
 > report a number of collected results than
 >
 > the actual number of results. I see on the order of couple of
 > thousands vs 20.
 >
 >
 > Best regards
 >
 >
 >

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 <mailto:java-user-unsubscr...@lucene.apache.org>
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 <mailto:java-user-h...@lucene.apache.org>



--
Adrien




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: An interesting case

2021-06-08 Thread Adrien Grand
If you don't need any information about the total hit count, you could
create a TopScoreDocCollector that has the same value for numHits
and totalHitsThreshold. This way Lucene will spend as little energy as
possible computing the number of matches of the query.

On Tue, Jun 8, 2021 at 6:28 PM  wrote:

> i am currently happy with Lucene performance but i want to understand
> and speedup further
>
> by limiting the results concretely. So i still donot know why totalHits
> and scoredocs report
>
> different number of hits.
>
>
> Best regards
>
>
> On 6/8/21 2:52 AM, Baris Kazar wrote:
> > my worry is actually about the lucene's performance.
> >
> > if lucene collects thousands of hits instead of actually n (<<< a
> > couple of 1000s) hits, then this creates performance issue.
> >
> > ScoreDoc array is ok as i mentioned ie, it has size n.
> > i will check count api.
> >
> > Best regards
> > 
> > *From:* Adrien Grand 
> > *Sent:* Tuesday, June 8, 2021 2:46 AM
> > *To:* Lucene Users Mailing List
> > *Cc:* Baris Kazar
> > *Subject:* Re: An interesting case
> > When you call IndexSearcher#search(Query query, int n), there are two
> > cases:
> >  - either your query matches n hits or more, and the TopDocs object
> > will have a ScoreDoc[] array that contains the n best scoring hits
> > sorted by descending score,
> >  - or your query matches less then n hits and then the TopDocs object
> > will have all matches in the ScoreDoc[] array, sorted by descending
> score.
> >
> > In both cases, TopDocs#totalHits gives information about the total
> > number of matches of the query. On older versions of Lucene (<7.0)
> > this is an integer that is always accurate, while on more recent
> > versions of Lucene (>= 8.0) it is a lower bound of the total number of
> > matches. It typically returns the number of collected documents
> > indeed, though this is an implementation detail that might change in
> > the future.
> >
> > If you want to count the number of matches of a Query precisely, you
> > can use IndexSearcher#count.
> >
> > On Tue, Jun 8, 2021 at 7:51 AM  > <mailto:baris.ka...@oracle.com>> wrote:
> >
> >
> https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search
> > <
> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$
> >
> >
> > looks like someone else also had this problem, too.
> >
> > Any suggestions please?
> >
> > Best regards
> >
> >
> > On 6/8/21 1:36 AM, baris.ka...@oracle.com
> > <mailto:baris.ka...@oracle.com> wrote:
> > > Hi,-
> > >
> > >  I use IndexSearcher.search API with two parameters like Query
> > and int
> > > number (i set as 20).
> > >
> > > However, when i look at the TopDocs object which is the result
> > of this
> > > above API call
> > >
> > > i see thousands of hits from totalhits. Is this inaccurate or
> > Lucene
> > > is doing actually search based on that many results?
> > >
> > > But when i iterate over result of above API call's scoreDocs
> > object i
> > > get int number of hits (ie, 20 hits).
> > >
> > >
> > > I am trying to find out why
> > org.apache.lucene.search.Topdocs.TotalHits
> > > report a number of collected results than
> > >
> > > the actual number of results. I see on the order of couple of
> > > thousands vs 20.
> > >
> > >
> > > Best regards
> > >
> > >
> > >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > <mailto:java-user-unsubscr...@lucene.apache.org>
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > <mailto:java-user-h...@lucene.apache.org>
> >
> >
> >
> > --
> > Adrien
>


-- 
Adrien


Re: An interesting case

2021-06-08 Thread baris . kazar
i am currently happy with Lucene performance but i want to understand 
and speedup further


by limiting the results concretely. So i still donot know why totalHits 
and scoredocs report


different number of hits.


Best regards


On 6/8/21 2:52 AM, Baris Kazar wrote:

my worry is actually about the lucene's performance.

if lucene collects thousands of hits instead of actually n (<<< a 
couple of 1000s) hits, then this creates performance issue.


ScoreDoc array is ok as i mentioned ie, it has size n.
i will check count api.

Best regards

*From:* Adrien Grand 
*Sent:* Tuesday, June 8, 2021 2:46 AM
*To:* Lucene Users Mailing List
*Cc:* Baris Kazar
*Subject:* Re: An interesting case
When you call IndexSearcher#search(Query query, int n), there are two 
cases:
 - either your query matches n hits or more, and the TopDocs object 
will have a ScoreDoc[] array that contains the n best scoring hits 
sorted by descending score,
 - or your query matches less then n hits and then the TopDocs object 
will have all matches in the ScoreDoc[] array, sorted by descending score.


In both cases, TopDocs#totalHits gives information about the total 
number of matches of the query. On older versions of Lucene (<7.0) 
this is an integer that is always accurate, while on more recent 
versions of Lucene (>= 8.0) it is a lower bound of the total number of 
matches. It typically returns the number of collected documents 
indeed, though this is an implementation detail that might change in 
the future.


If you want to count the number of matches of a Query precisely, you 
can use IndexSearcher#count.


On Tue, Jun 8, 2021 at 7:51 AM <mailto:baris.ka...@oracle.com>> wrote:



https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search

<https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$>

looks like someone else also had this problem, too.

Any suggestions please?

Best regards


On 6/8/21 1:36 AM, baris.ka...@oracle.com
<mailto:baris.ka...@oracle.com> wrote:
> Hi,-
>
>  I use IndexSearcher.search API with two parameters like Query
and int
> number (i set as 20).
>
> However, when i look at the TopDocs object which is the result
of this
> above API call
>
> i see thousands of hits from totalhits. Is this inaccurate or
Lucene
> is doing actually search based on that many results?
>
> But when i iterate over result of above API call's scoreDocs
object i
> get int number of hits (ie, 20 hits).
>
>
> I am trying to find out why
org.apache.lucene.search.Topdocs.TotalHits
> report a number of collected results than
>
> the actual number of results. I see on the order of couple of
> thousands vs 20.
>
>
> Best regards
>
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
<mailto:java-user-unsubscr...@lucene.apache.org>
For additional commands, e-mail: java-user-h...@lucene.apache.org
<mailto:java-user-h...@lucene.apache.org>



--
Adrien


Re: An interesting case

2021-06-07 Thread Baris Kazar
my worry is actually about the lucene's performance.

if lucene collects thousands of hits instead of actually n (<<< a couple of 
1000s) hits, then this creates performance issue.

ScoreDoc array is ok as i mentioned ie, it has size n.
i will check count api.

Best regards

From: Adrien Grand 
Sent: Tuesday, June 8, 2021 2:46 AM
To: Lucene Users Mailing List
Cc: Baris Kazar
Subject: Re: An interesting case

When you call IndexSearcher#search(Query query, int n), there are two cases:
 - either your query matches n hits or more, and the TopDocs object will have a 
ScoreDoc[] array that contains the n best scoring hits sorted by descending 
score,
 - or your query matches less then n hits and then the TopDocs object will have 
all matches in the ScoreDoc[] array, sorted by descending score.

In both cases, TopDocs#totalHits gives information about the total number of 
matches of the query. On older versions of Lucene (<7.0) this is an integer 
that is always accurate, while on more recent versions of Lucene (>= 8.0) it is 
a lower bound of the total number of matches. It typically returns the number 
of collected documents indeed, though this is an implementation detail that 
might change in the future.

If you want to count the number of matches of a Query precisely, you can use 
IndexSearcher#count.

On Tue, Jun 8, 2021 at 7:51 AM 
mailto:baris.ka...@oracle.com>> wrote:
https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search<https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$>

looks like someone else also had this problem, too.

Any suggestions please?

Best regards


On 6/8/21 1:36 AM, baris.ka...@oracle.com<mailto:baris.ka...@oracle.com> wrote:
> Hi,-
>
>  I use IndexSearcher.search API with two parameters like Query and int
> number (i set as 20).
>
> However, when i look at the TopDocs object which is the result of this
> above API call
>
> i see thousands of hits from totalhits. Is this inaccurate or Lucene
> is doing actually search based on that many results?
>
> But when i iterate over result of above API call's scoreDocs object i
> get int number of hits (ie, 20 hits).
>
>
> I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits
> report a number of collected results than
>
> the actual number of results. I see on the order of couple of
> thousands vs 20.
>
>
> Best regards
>
>
>

-
To unsubscribe, e-mail: 
java-user-unsubscr...@lucene.apache.org<mailto:java-user-unsubscr...@lucene.apache.org>
For additional commands, e-mail: 
java-user-h...@lucene.apache.org<mailto:java-user-h...@lucene.apache.org>



--
Adrien


Re: An interesting case

2021-06-07 Thread Adrien Grand
When you call IndexSearcher#search(Query query, int n), there are two cases:
 - either your query matches n hits or more, and the TopDocs object will
have a ScoreDoc[] array that contains the n best scoring hits sorted by
descending score,
 - or your query matches less then n hits and then the TopDocs object will
have all matches in the ScoreDoc[] array, sorted by descending score.

In both cases, TopDocs#totalHits gives information about the total number
of matches of the query. On older versions of Lucene (<7.0) this is an
integer that is always accurate, while on more recent versions of Lucene
(>= 8.0) it is a lower bound of the total number of matches. It typically
returns the number of collected documents indeed, though this is an
implementation detail that might change in the future.

If you want to count the number of matches of a Query precisely, you can
use IndexSearcher#count.

On Tue, Jun 8, 2021 at 7:51 AM  wrote:

>
> https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search
>
> looks like someone else also had this problem, too.
>
> Any suggestions please?
>
> Best regards
>
>
> On 6/8/21 1:36 AM, baris.ka...@oracle.com wrote:
> > Hi,-
> >
> >  I use IndexSearcher.search API with two parameters like Query and int
> > number (i set as 20).
> >
> > However, when i look at the TopDocs object which is the result of this
> > above API call
> >
> > i see thousands of hits from totalhits. Is this inaccurate or Lucene
> > is doing actually search based on that many results?
> >
> > But when i iterate over result of above API call's scoreDocs object i
> > get int number of hits (ie, 20 hits).
> >
> >
> > I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits
> > report a number of collected results than
> >
> > the actual number of results. I see on the order of couple of
> > thousands vs 20.
> >
> >
> > Best regards
> >
> >
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: An interesting case

2021-06-07 Thread baris . kazar

https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search

looks like someone else also had this problem, too.

Any suggestions please?

Best regards


On 6/8/21 1:36 AM, baris.ka...@oracle.com wrote:

Hi,-

 I use IndexSearcher.search API with two parameters like Query and int 
number (i set as 20).


However, when i look at the TopDocs object which is the result of this 
above API call


i see thousands of hits from totalhits. Is this inaccurate or Lucene 
is doing actually search based on that many results?


But when i iterate over result of above API call's scoreDocs object i 
get int number of hits (ie, 20 hits).



I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits 
report a number of collected results than


the actual number of results. I see on the order of couple of 
thousands vs 20.



Best regards





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



An interesting case

2021-06-07 Thread baris . kazar

Hi,-

 I use IndexSearcher.search API with two parameters like Query and int 
number (i set as 20).


However, when i look at the TopDocs object which is the result of this 
above API call


i see thousands of hits from totalhits. Is this inaccurate or Lucene is 
doing actually search based on that many results?


But when i iterate over result of above API call's scoreDocs object i 
get int number of hits (ie, 20 hits).



I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits 
report a number of collected results than


the actual number of results. I see on the order of couple of thousands 
vs 20.



Best regards




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org