Re: An interesting case
gy as >> possible computing the number of matches of the query. >> >> On Tue, Jun 8, 2021 at 6:28 PM mailto:baris.ka...@oracle.com>> wrote: >> >>> i am currently happy with Lucene performance but i want to understand >>> and speedup further >>> >>> by limiting the results concretely. So i still donot know why totalHits >>> and scoredocs report >>> >>> different number of hits. >>> >>> >>> Best regards >>> >>> >>> On 6/8/21 2:52 AM, Baris Kazar wrote: >>>> my worry is actually about the lucene's performance. >>>> >>>> if lucene collects thousands of hits instead of actually n (<<< a >>>> couple of 1000s) hits, then this creates performance issue. >>>> >>>> ScoreDoc array is ok as i mentioned ie, it has size n. >>>> i will check count api. >>>> >>>> Best regards >>>> >>>> >>>> *From:* Adrien Grand mailto:jpou...@gmail.com>> >>>> *Sent:* Tuesday, June 8, 2021 2:46 AM >>>> *To:* Lucene Users Mailing List >>>> *Cc:* Baris Kazar >>>> *Subject:* Re: An interesting case >>>> When you call IndexSearcher#search(Query query, int n), there are two >>>> cases: >>>> - either your query matches n hits or more, and the TopDocs object >>>> will have a ScoreDoc[] array that contains the n best scoring hits >>>> sorted by descending score, >>>> - or your query matches less then n hits and then the TopDocs object >>>> will have all matches in the ScoreDoc[] array, sorted by descending >>> score. >>>> In both cases, TopDocs#totalHits gives information about the total >>>> number of matches of the query. On older versions of Lucene (<7.0) >>>> this is an integer that is always accurate, while on more recent >>>> versions of Lucene (>= 8.0) it is a lower bound of the total number of >>>> matches. It typically returns the number of collected documents >>>> indeed, though this is an implementation detail that might change in >>>> the future. >>>> >>>> If you want to count the number of matches of a Query precisely, you >>>> can use IndexSearcher#count. >>>> >>>> On Tue, Jun 8, 2021 at 7:51 AM mailto:baris.ka...@oracle.com> >>>> <mailto:baris.ka...@oracle.com <mailto:baris.ka...@oracle.com>>> wrote: >>>> >>>> >>> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$ <https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$> >>> >>>> < >>> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$ <https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$> >>> >>>> >>>> looks like someone else also had this problem, too. >>>> >>>> Any suggestions please? >>>> >>>> Best regards >>>> >>>> >>>> On 6/8/21 1:36 AM, baris.ka...@oracle.com <mailto:baris.ka...@oracle.com> >>>> <mailto:baris.ka...@oracle.com <mailto:baris.ka...@oracle.com>> wrote: >>>> > Hi,- >>>> > >>>> > I use IndexSearcher.search API with two parameters like Query >>>> and int >>>> > number (i set as 20). >>>> >
Re: An interesting case
May i please again suggest? the Javadocs need to be enhanced for Lucene There needs to be more info and explain parameters and more importantly in terms of performance why these two classes (TopScoreDocsCollector vs IndexSearcher) differ for performance. Thanks On 6/8/21 2:07 PM, baris.ka...@oracle.com wrote: yes i see sometimes 4000+, sometimes 3000+ hits from totalhits. So TopScoreDocsCollector is working underneath IndexSearcher.search api, right? in other words TopScoreDocsCollector will be saving time, right? Thanks On 6/8/21 1:27 PM, Adrien Grand wrote: Yes, for instance if you care about the top 10 hits only, you could call TopScoreDocsCollector.create(10, null, 10). By default, IndexSearcher is configured to count at least 1,000 hits, and creates its top docs collector with TopScoreDocsCollector.create(10, null, 1000). On Tue, Jun 8, 2021 at 7:19 PM <mailto:baris.ka...@oracle.com>> wrote: Ok i think you meant something else here. you are not refering to total number of hits calculation or the mismatch, right? so to make lucene minimum work to reach the matched docs TopScoreDocCollector should be used, right? Let me check this class. Thanks On 6/8/21 1:16 PM, baris.ka...@oracle.com <mailto:baris.ka...@oracle.com> wrote: > Adrien my concern is not actually the number mismatch > > as i mentioned it is the performance. > > > seeing those numbers mismatch it seems that lucene is still doing same > > amount of work to get results no matter how many results you need in > the indexsearcher search api. > > > i thought i was clear on that. > > > Lucene should not spend any energy for the count as scoredocs already > has that. > > But seeing totalhits high number, that worries me as i explained above. > > > Best regards > > > On 6/8/21 1:12 PM, Adrien Grand wrote: >> If you don't need any information about the total hit count, you could >> create a TopScoreDocCollector that has the same value for numHits >> and totalHitsThreshold. This way Lucene will spend as little energy as >> possible computing the number of matches of the query. >> >> On Tue, Jun 8, 2021 at 6:28 PM mailto:baris.ka...@oracle.com>> wrote: >> >>> i am currently happy with Lucene performance but i want to understand >>> and speedup further >>> >>> by limiting the results concretely. So i still donot know why totalHits >>> and scoredocs report >>> >>> different number of hits. >>> >>> >>> Best regards >>> >>> >>> On 6/8/21 2:52 AM, Baris Kazar wrote: >>>> my worry is actually about the lucene's performance. >>>> >>>> if lucene collects thousands of hits instead of actually n (<<< a >>>> couple of 1000s) hits, then this creates performance issue. >>>> >>>> ScoreDoc array is ok as i mentioned ie, it has size n. >>>> i will check count api. >>>> >>>> Best regards >>>> >>>> >>>> *From:* Adrien Grand mailto:jpou...@gmail.com>> >>>> *Sent:* Tuesday, June 8, 2021 2:46 AM >>>> *To:* Lucene Users Mailing List >>>> *Cc:* Baris Kazar >>>> *Subject:* Re: An interesting case >>>> When you call IndexSearcher#search(Query query, int n), there are two >>>> cases: >>>> - either your query matches n hits or more, and the TopDocs object >>>> will have a ScoreDoc[] array that contains the n best scoring hits >>>> sorted by descending score, >>>> - or your query matches less then n hits and then the TopDocs object >>>> will have all matches in the ScoreDoc[] array, sorted by descending >>> score. >>>> In both cases, TopDocs#totalHits gives information about the total >>>> number of matches of the query. On older versions of Lucene (<7.0) >>>> this is an integer that is always accurate, while on more recent >>>> versions of Lucene (>= 8.0) it is a lower bound of the total number of >>>> matches. It typically returns the number of collected documents >>>> indeed, though this is an implementation detail that might change i
Re: An interesting case
yes i see sometimes 4000+, sometimes 3000+ hits from totalhits. So TopScoreDocsCollector is working underneath IndexSearcher.search api, right? in other words TopScoreDocsCollector will be saving time, right? Thanks On 6/8/21 1:27 PM, Adrien Grand wrote: Yes, for instance if you care about the top 10 hits only, you could call TopScoreDocsCollector.create(10, null, 10). By default, IndexSearcher is configured to count at least 1,000 hits, and creates its top docs collector with TopScoreDocsCollector.create(10, null, 1000). On Tue, Jun 8, 2021 at 7:19 PM <mailto:baris.ka...@oracle.com>> wrote: Ok i think you meant something else here. you are not refering to total number of hits calculation or the mismatch, right? so to make lucene minimum work to reach the matched docs TopScoreDocCollector should be used, right? Let me check this class. Thanks On 6/8/21 1:16 PM, baris.ka...@oracle.com <mailto:baris.ka...@oracle.com> wrote: > Adrien my concern is not actually the number mismatch > > as i mentioned it is the performance. > > > seeing those numbers mismatch it seems that lucene is still doing same > > amount of work to get results no matter how many results you need in > the indexsearcher search api. > > > i thought i was clear on that. > > > Lucene should not spend any energy for the count as scoredocs already > has that. > > But seeing totalhits high number, that worries me as i explained above. > > > Best regards > > > On 6/8/21 1:12 PM, Adrien Grand wrote: >> If you don't need any information about the total hit count, you could >> create a TopScoreDocCollector that has the same value for numHits >> and totalHitsThreshold. This way Lucene will spend as little energy as >> possible computing the number of matches of the query. >> >> On Tue, Jun 8, 2021 at 6:28 PM mailto:baris.ka...@oracle.com>> wrote: >> >>> i am currently happy with Lucene performance but i want to understand >>> and speedup further >>> >>> by limiting the results concretely. So i still donot know why totalHits >>> and scoredocs report >>> >>> different number of hits. >>> >>> >>> Best regards >>> >>> >>> On 6/8/21 2:52 AM, Baris Kazar wrote: >>>> my worry is actually about the lucene's performance. >>>> >>>> if lucene collects thousands of hits instead of actually n (<<< a >>>> couple of 1000s) hits, then this creates performance issue. >>>> >>>> ScoreDoc array is ok as i mentioned ie, it has size n. >>>> i will check count api. >>>> >>>> Best regards >>>> >>>> >>>> *From:* Adrien Grand mailto:jpou...@gmail.com>> >>>> *Sent:* Tuesday, June 8, 2021 2:46 AM >>>> *To:* Lucene Users Mailing List >>>> *Cc:* Baris Kazar >>>> *Subject:* Re: An interesting case >>>> When you call IndexSearcher#search(Query query, int n), there are two >>>> cases: >>>> - either your query matches n hits or more, and the TopDocs object >>>> will have a ScoreDoc[] array that contains the n best scoring hits >>>> sorted by descending score, >>>> - or your query matches less then n hits and then the TopDocs object >>>> will have all matches in the ScoreDoc[] array, sorted by descending >>> score. >>>> In both cases, TopDocs#totalHits gives information about the total >>>> number of matches of the query. On older versions of Lucene (<7.0) >>>> this is an integer that is always accurate, while on more recent >>>> versions of Lucene (>= 8.0) it is a lower bound of the total number of >>>> matches. It typically returns the number of collected documents >>>> indeed, though this is an implementation detail that might change in >>>> the future. >>>> >>>> If you want to count the number of matches of a Query precisely, you >>>> can use IndexSearcher#count. >>>> >>>> On Tue, Jun 8, 2021 at 7:51 AM mailto:baris.ka...@oracle.com> >>>>
Re: An interesting case
Yes, for instance if you care about the top 10 hits only, you could call TopScoreDocsCollector.create(10, null, 10). By default, IndexSearcher is configured to count at least 1,000 hits, and creates its top docs collector with TopScoreDocsCollector.create(10, null, 1000). On Tue, Jun 8, 2021 at 7:19 PM wrote: > Ok i think you meant something else here. > > you are not refering to total number of hits calculation or the > mismatch, right? > > > > so to make lucene minimum work to reach the matched docs > > > TopScoreDocCollector should be used, right? > > > Let me check this class. > > Thanks > > > On 6/8/21 1:16 PM, baris.ka...@oracle.com wrote: > > Adrien my concern is not actually the number mismatch > > > > as i mentioned it is the performance. > > > > > > seeing those numbers mismatch it seems that lucene is still doing same > > > > amount of work to get results no matter how many results you need in > > the indexsearcher search api. > > > > > > i thought i was clear on that. > > > > > > Lucene should not spend any energy for the count as scoredocs already > > has that. > > > > But seeing totalhits high number, that worries me as i explained above. > > > > > > Best regards > > > > > > On 6/8/21 1:12 PM, Adrien Grand wrote: > >> If you don't need any information about the total hit count, you could > >> create a TopScoreDocCollector that has the same value for numHits > >> and totalHitsThreshold. This way Lucene will spend as little energy as > >> possible computing the number of matches of the query. > >> > >> On Tue, Jun 8, 2021 at 6:28 PM wrote: > >> > >>> i am currently happy with Lucene performance but i want to understand > >>> and speedup further > >>> > >>> by limiting the results concretely. So i still donot know why totalHits > >>> and scoredocs report > >>> > >>> different number of hits. > >>> > >>> > >>> Best regards > >>> > >>> > >>> On 6/8/21 2:52 AM, Baris Kazar wrote: > >>>> my worry is actually about the lucene's performance. > >>>> > >>>> if lucene collects thousands of hits instead of actually n (<<< a > >>>> couple of 1000s) hits, then this creates performance issue. > >>>> > >>>> ScoreDoc array is ok as i mentioned ie, it has size n. > >>>> i will check count api. > >>>> > >>>> Best regards > >>>> > > >>>> > >>>> *From:* Adrien Grand > >>>> *Sent:* Tuesday, June 8, 2021 2:46 AM > >>>> *To:* Lucene Users Mailing List > >>>> *Cc:* Baris Kazar > >>>> *Subject:* Re: An interesting case > >>>> When you call IndexSearcher#search(Query query, int n), there are two > >>>> cases: > >>>> - either your query matches n hits or more, and the TopDocs object > >>>> will have a ScoreDoc[] array that contains the n best scoring hits > >>>> sorted by descending score, > >>>> - or your query matches less then n hits and then the TopDocs object > >>>> will have all matches in the ScoreDoc[] array, sorted by descending > >>> score. > >>>> In both cases, TopDocs#totalHits gives information about the total > >>>> number of matches of the query. On older versions of Lucene (<7.0) > >>>> this is an integer that is always accurate, while on more recent > >>>> versions of Lucene (>= 8.0) it is a lower bound of the total number of > >>>> matches. It typically returns the number of collected documents > >>>> indeed, though this is an implementation detail that might change in > >>>> the future. > >>>> > >>>> If you want to count the number of matches of a Query precisely, you > >>>> can use IndexSearcher#count. > >>>> > >>>> On Tue, Jun 8, 2021 at 7:51 AM >>>> <mailto:baris.ka...@oracle.com>> wrote: > >>>> > >>>> > >>> > https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$ > >>> > >>>>
Re: An interesting case
Ok i think you meant something else here. you are not refering to total number of hits calculation or the mismatch, right? so to make lucene minimum work to reach the matched docs TopScoreDocCollector should be used, right? Let me check this class. Thanks On 6/8/21 1:16 PM, baris.ka...@oracle.com wrote: Adrien my concern is not actually the number mismatch as i mentioned it is the performance. seeing those numbers mismatch it seems that lucene is still doing same amount of work to get results no matter how many results you need in the indexsearcher search api. i thought i was clear on that. Lucene should not spend any energy for the count as scoredocs already has that. But seeing totalhits high number, that worries me as i explained above. Best regards On 6/8/21 1:12 PM, Adrien Grand wrote: If you don't need any information about the total hit count, you could create a TopScoreDocCollector that has the same value for numHits and totalHitsThreshold. This way Lucene will spend as little energy as possible computing the number of matches of the query. On Tue, Jun 8, 2021 at 6:28 PM wrote: i am currently happy with Lucene performance but i want to understand and speedup further by limiting the results concretely. So i still donot know why totalHits and scoredocs report different number of hits. Best regards On 6/8/21 2:52 AM, Baris Kazar wrote: my worry is actually about the lucene's performance. if lucene collects thousands of hits instead of actually n (<<< a couple of 1000s) hits, then this creates performance issue. ScoreDoc array is ok as i mentioned ie, it has size n. i will check count api. Best regards *From:* Adrien Grand *Sent:* Tuesday, June 8, 2021 2:46 AM *To:* Lucene Users Mailing List *Cc:* Baris Kazar *Subject:* Re: An interesting case When you call IndexSearcher#search(Query query, int n), there are two cases: - either your query matches n hits or more, and the TopDocs object will have a ScoreDoc[] array that contains the n best scoring hits sorted by descending score, - or your query matches less then n hits and then the TopDocs object will have all matches in the ScoreDoc[] array, sorted by descending score. In both cases, TopDocs#totalHits gives information about the total number of matches of the query. On older versions of Lucene (<7.0) this is an integer that is always accurate, while on more recent versions of Lucene (>= 8.0) it is a lower bound of the total number of matches. It typically returns the number of collected documents indeed, though this is an implementation detail that might change in the future. If you want to count the number of matches of a Query precisely, you can use IndexSearcher#count. On Tue, Jun 8, 2021 at 7:51 AM mailto:baris.ka...@oracle.com>> wrote: https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$ < https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$ looks like someone else also had this problem, too. Any suggestions please? Best regards On 6/8/21 1:36 AM, baris.ka...@oracle.com <mailto:baris.ka...@oracle.com> wrote: > Hi,- > > I use IndexSearcher.search API with two parameters like Query and int > number (i set as 20). > > However, when i look at the TopDocs object which is the result of this > above API call > > i see thousands of hits from totalhits. Is this inaccurate or Lucene > is doing actually search based on that many results? > > But when i iterate over result of above API call's scoreDocs object i > get int number of hits (ie, 20 hits). > > > I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits > report a number of collected results than > > the actual number of results. I see on the order of couple of > thousands vs 20. > > > Best regards > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org <mailto:java-user-unsubscr...@lucene.apache.org> For additional commands, e-mail: java-user-h...@lucene.apache.org <mailto:java-user-h...@lucene.apache.org> -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: An interesting case
Adrien my concern is not actually the number mismatch as i mentioned it is the performance. seeing those numbers mismatch it seems that lucene is still doing same amount of work to get results no matter how many results you need in the indexsearcher search api. i thought i was clear on that. Lucene should not spend any energy for the count as scoredocs already has that. But seeing totalhits high number, that worries me as i explained above. Best regards On 6/8/21 1:12 PM, Adrien Grand wrote: If you don't need any information about the total hit count, you could create a TopScoreDocCollector that has the same value for numHits and totalHitsThreshold. This way Lucene will spend as little energy as possible computing the number of matches of the query. On Tue, Jun 8, 2021 at 6:28 PM wrote: i am currently happy with Lucene performance but i want to understand and speedup further by limiting the results concretely. So i still donot know why totalHits and scoredocs report different number of hits. Best regards On 6/8/21 2:52 AM, Baris Kazar wrote: my worry is actually about the lucene's performance. if lucene collects thousands of hits instead of actually n (<<< a couple of 1000s) hits, then this creates performance issue. ScoreDoc array is ok as i mentioned ie, it has size n. i will check count api. Best regards *From:* Adrien Grand *Sent:* Tuesday, June 8, 2021 2:46 AM *To:* Lucene Users Mailing List *Cc:* Baris Kazar *Subject:* Re: An interesting case When you call IndexSearcher#search(Query query, int n), there are two cases: - either your query matches n hits or more, and the TopDocs object will have a ScoreDoc[] array that contains the n best scoring hits sorted by descending score, - or your query matches less then n hits and then the TopDocs object will have all matches in the ScoreDoc[] array, sorted by descending score. In both cases, TopDocs#totalHits gives information about the total number of matches of the query. On older versions of Lucene (<7.0) this is an integer that is always accurate, while on more recent versions of Lucene (>= 8.0) it is a lower bound of the total number of matches. It typically returns the number of collected documents indeed, though this is an implementation detail that might change in the future. If you want to count the number of matches of a Query precisely, you can use IndexSearcher#count. On Tue, Jun 8, 2021 at 7:51 AM mailto:baris.ka...@oracle.com>> wrote: https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$ < https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$ looks like someone else also had this problem, too. Any suggestions please? Best regards On 6/8/21 1:36 AM, baris.ka...@oracle.com <mailto:baris.ka...@oracle.com> wrote: > Hi,- > > I use IndexSearcher.search API with two parameters like Query and int > number (i set as 20). > > However, when i look at the TopDocs object which is the result of this > above API call > > i see thousands of hits from totalhits. Is this inaccurate or Lucene > is doing actually search based on that many results? > > But when i iterate over result of above API call's scoreDocs object i > get int number of hits (ie, 20 hits). > > > I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits > report a number of collected results than > > the actual number of results. I see on the order of couple of > thousands vs 20. > > > Best regards > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org <mailto:java-user-unsubscr...@lucene.apache.org> For additional commands, e-mail: java-user-h...@lucene.apache.org <mailto:java-user-h...@lucene.apache.org> -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: An interesting case
If you don't need any information about the total hit count, you could create a TopScoreDocCollector that has the same value for numHits and totalHitsThreshold. This way Lucene will spend as little energy as possible computing the number of matches of the query. On Tue, Jun 8, 2021 at 6:28 PM wrote: > i am currently happy with Lucene performance but i want to understand > and speedup further > > by limiting the results concretely. So i still donot know why totalHits > and scoredocs report > > different number of hits. > > > Best regards > > > On 6/8/21 2:52 AM, Baris Kazar wrote: > > my worry is actually about the lucene's performance. > > > > if lucene collects thousands of hits instead of actually n (<<< a > > couple of 1000s) hits, then this creates performance issue. > > > > ScoreDoc array is ok as i mentioned ie, it has size n. > > i will check count api. > > > > Best regards > > > > *From:* Adrien Grand > > *Sent:* Tuesday, June 8, 2021 2:46 AM > > *To:* Lucene Users Mailing List > > *Cc:* Baris Kazar > > *Subject:* Re: An interesting case > > When you call IndexSearcher#search(Query query, int n), there are two > > cases: > > - either your query matches n hits or more, and the TopDocs object > > will have a ScoreDoc[] array that contains the n best scoring hits > > sorted by descending score, > > - or your query matches less then n hits and then the TopDocs object > > will have all matches in the ScoreDoc[] array, sorted by descending > score. > > > > In both cases, TopDocs#totalHits gives information about the total > > number of matches of the query. On older versions of Lucene (<7.0) > > this is an integer that is always accurate, while on more recent > > versions of Lucene (>= 8.0) it is a lower bound of the total number of > > matches. It typically returns the number of collected documents > > indeed, though this is an implementation detail that might change in > > the future. > > > > If you want to count the number of matches of a Query precisely, you > > can use IndexSearcher#count. > > > > On Tue, Jun 8, 2021 at 7:51 AM > <mailto:baris.ka...@oracle.com>> wrote: > > > > > https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search > > < > https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$ > > > > > > looks like someone else also had this problem, too. > > > > Any suggestions please? > > > > Best regards > > > > > > On 6/8/21 1:36 AM, baris.ka...@oracle.com > > <mailto:baris.ka...@oracle.com> wrote: > > > Hi,- > > > > > > I use IndexSearcher.search API with two parameters like Query > > and int > > > number (i set as 20). > > > > > > However, when i look at the TopDocs object which is the result > > of this > > > above API call > > > > > > i see thousands of hits from totalhits. Is this inaccurate or > > Lucene > > > is doing actually search based on that many results? > > > > > > But when i iterate over result of above API call's scoreDocs > > object i > > > get int number of hits (ie, 20 hits). > > > > > > > > > I am trying to find out why > > org.apache.lucene.search.Topdocs.TotalHits > > > report a number of collected results than > > > > > > the actual number of results. I see on the order of couple of > > > thousands vs 20. > > > > > > > > > Best regards > > > > > > > > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > <mailto:java-user-unsubscr...@lucene.apache.org> > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > <mailto:java-user-h...@lucene.apache.org> > > > > > > > > -- > > Adrien > -- Adrien
Re: An interesting case
i am currently happy with Lucene performance but i want to understand and speedup further by limiting the results concretely. So i still donot know why totalHits and scoredocs report different number of hits. Best regards On 6/8/21 2:52 AM, Baris Kazar wrote: my worry is actually about the lucene's performance. if lucene collects thousands of hits instead of actually n (<<< a couple of 1000s) hits, then this creates performance issue. ScoreDoc array is ok as i mentioned ie, it has size n. i will check count api. Best regards *From:* Adrien Grand *Sent:* Tuesday, June 8, 2021 2:46 AM *To:* Lucene Users Mailing List *Cc:* Baris Kazar *Subject:* Re: An interesting case When you call IndexSearcher#search(Query query, int n), there are two cases: - either your query matches n hits or more, and the TopDocs object will have a ScoreDoc[] array that contains the n best scoring hits sorted by descending score, - or your query matches less then n hits and then the TopDocs object will have all matches in the ScoreDoc[] array, sorted by descending score. In both cases, TopDocs#totalHits gives information about the total number of matches of the query. On older versions of Lucene (<7.0) this is an integer that is always accurate, while on more recent versions of Lucene (>= 8.0) it is a lower bound of the total number of matches. It typically returns the number of collected documents indeed, though this is an implementation detail that might change in the future. If you want to count the number of matches of a Query precisely, you can use IndexSearcher#count. On Tue, Jun 8, 2021 at 7:51 AM <mailto:baris.ka...@oracle.com>> wrote: https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search <https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$> looks like someone else also had this problem, too. Any suggestions please? Best regards On 6/8/21 1:36 AM, baris.ka...@oracle.com <mailto:baris.ka...@oracle.com> wrote: > Hi,- > > I use IndexSearcher.search API with two parameters like Query and int > number (i set as 20). > > However, when i look at the TopDocs object which is the result of this > above API call > > i see thousands of hits from totalhits. Is this inaccurate or Lucene > is doing actually search based on that many results? > > But when i iterate over result of above API call's scoreDocs object i > get int number of hits (ie, 20 hits). > > > I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits > report a number of collected results than > > the actual number of results. I see on the order of couple of > thousands vs 20. > > > Best regards > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org <mailto:java-user-unsubscr...@lucene.apache.org> For additional commands, e-mail: java-user-h...@lucene.apache.org <mailto:java-user-h...@lucene.apache.org> -- Adrien
Re: An interesting case
my worry is actually about the lucene's performance. if lucene collects thousands of hits instead of actually n (<<< a couple of 1000s) hits, then this creates performance issue. ScoreDoc array is ok as i mentioned ie, it has size n. i will check count api. Best regards From: Adrien Grand Sent: Tuesday, June 8, 2021 2:46 AM To: Lucene Users Mailing List Cc: Baris Kazar Subject: Re: An interesting case When you call IndexSearcher#search(Query query, int n), there are two cases: - either your query matches n hits or more, and the TopDocs object will have a ScoreDoc[] array that contains the n best scoring hits sorted by descending score, - or your query matches less then n hits and then the TopDocs object will have all matches in the ScoreDoc[] array, sorted by descending score. In both cases, TopDocs#totalHits gives information about the total number of matches of the query. On older versions of Lucene (<7.0) this is an integer that is always accurate, while on more recent versions of Lucene (>= 8.0) it is a lower bound of the total number of matches. It typically returns the number of collected documents indeed, though this is an implementation detail that might change in the future. If you want to count the number of matches of a Query precisely, you can use IndexSearcher#count. On Tue, Jun 8, 2021 at 7:51 AM mailto:baris.ka...@oracle.com>> wrote: https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search<https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$> looks like someone else also had this problem, too. Any suggestions please? Best regards On 6/8/21 1:36 AM, baris.ka...@oracle.com<mailto:baris.ka...@oracle.com> wrote: > Hi,- > > I use IndexSearcher.search API with two parameters like Query and int > number (i set as 20). > > However, when i look at the TopDocs object which is the result of this > above API call > > i see thousands of hits from totalhits. Is this inaccurate or Lucene > is doing actually search based on that many results? > > But when i iterate over result of above API call's scoreDocs object i > get int number of hits (ie, 20 hits). > > > I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits > report a number of collected results than > > the actual number of results. I see on the order of couple of > thousands vs 20. > > > Best regards > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org<mailto:java-user-unsubscr...@lucene.apache.org> For additional commands, e-mail: java-user-h...@lucene.apache.org<mailto:java-user-h...@lucene.apache.org> -- Adrien
Re: An interesting case
When you call IndexSearcher#search(Query query, int n), there are two cases: - either your query matches n hits or more, and the TopDocs object will have a ScoreDoc[] array that contains the n best scoring hits sorted by descending score, - or your query matches less then n hits and then the TopDocs object will have all matches in the ScoreDoc[] array, sorted by descending score. In both cases, TopDocs#totalHits gives information about the total number of matches of the query. On older versions of Lucene (<7.0) this is an integer that is always accurate, while on more recent versions of Lucene (>= 8.0) it is a lower bound of the total number of matches. It typically returns the number of collected documents indeed, though this is an implementation detail that might change in the future. If you want to count the number of matches of a Query precisely, you can use IndexSearcher#count. On Tue, Jun 8, 2021 at 7:51 AM wrote: > > https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search > > looks like someone else also had this problem, too. > > Any suggestions please? > > Best regards > > > On 6/8/21 1:36 AM, baris.ka...@oracle.com wrote: > > Hi,- > > > > I use IndexSearcher.search API with two parameters like Query and int > > number (i set as 20). > > > > However, when i look at the TopDocs object which is the result of this > > above API call > > > > i see thousands of hits from totalhits. Is this inaccurate or Lucene > > is doing actually search based on that many results? > > > > But when i iterate over result of above API call's scoreDocs object i > > get int number of hits (ie, 20 hits). > > > > > > I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits > > report a number of collected results than > > > > the actual number of results. I see on the order of couple of > > thousands vs 20. > > > > > > Best regards > > > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Adrien
Re: An interesting case
https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search looks like someone else also had this problem, too. Any suggestions please? Best regards On 6/8/21 1:36 AM, baris.ka...@oracle.com wrote: Hi,- I use IndexSearcher.search API with two parameters like Query and int number (i set as 20). However, when i look at the TopDocs object which is the result of this above API call i see thousands of hits from totalhits. Is this inaccurate or Lucene is doing actually search based on that many results? But when i iterate over result of above API call's scoreDocs object i get int number of hits (ie, 20 hits). I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits report a number of collected results than the actual number of results. I see on the order of couple of thousands vs 20. Best regards - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
An interesting case
Hi,- I use IndexSearcher.search API with two parameters like Query and int number (i set as 20). However, when i look at the TopDocs object which is the result of this above API call i see thousands of hits from totalhits. Is this inaccurate or Lucene is doing actually search based on that many results? But when i iterate over result of above API call's scoreDocs object i get int number of hits (ie, 20 hits). I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits report a number of collected results than the actual number of results. I see on the order of couple of thousands vs 20. Best regards - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org