RE: how to get results without getting total number of found documents?

Vladimir Olenin Tue, 26 Sep 2006 16:06:30 -0700

Thanks, Mark, that clears things up a bit. No need to appologise - I am
quite a novice with Lucene.

To explain my concern a bit, assume that your inverted index is queried
with 'or' query for the most 'common' terms (ie, after excluding such
denominators as 'a', 'the', etc). Let's say, you have following terms:

- 'work': occurs in 200M documents
- 'java': occurs in 100M documents
- '.net': occurs in 100M documents

Now, if I'm doing a query: 'work OR java OR .net' the total result set
should be somewhere between 200M and 400M, right? But to get the exact
number you'll actually need to make the union of ALL the document Ids,
which means you'd have to loop through 400M ids at least. For more
complex queries the cost should be higher. The sorting step should be
quite expensive for the 'whole' dataset as well. The intersect should be
cheaper because each step eliminates some number of documents. In the
implementations I saw/did in the past the ability (or, to be more
correct, Unability) to create this kind of 'approximation' and chunk out
'most significant' results was the main limiting factor of all the
algorithms.

Thanks!

Vlad

PS: by 'computationally expensive', I mean 'scalability' as well - all
this operations are very CPU intensive, so if one query returns within
1-2 seconds during which time takes up 100% of CPU time, it means that
for 20 concurrent users the response time for the 'most unlucky' one
would be somewhere between 20 and 40 seconds.

-----Original Message-----
From: markharw00d [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 26, 2006 6:35 PM
To: [email protected]
Subject: Re: how to get results without getting total number of found
documents?

 >>- get the top 1000 results WITHOUT executing query across whole data
set

(Apologies if this is telling something you are already fully aware of )
- Counting matches doesn't involve scanning the text of all the docs so
may be less expensive than you think for a single index. It very quickly
looks up and ranks only the docs containing your search terms so a total
match count is not an expensive by-product of this operation - see a
description of inverted indexes for more details: 
http://en.wikipedia.org/wiki/Inverted_index

If you're aware of all that and considering larger scale problems
(billions of docs) where multiple machines/indexes must be queried in
parallel things are more complex. The cost of combining result scores
from multiple machines is typically why you can't page beyond 1000
results. Some of these large distributed  architectures will divide
content into popular/recent content and older/less popular content. 
Approximations for total number of matching docs are calculated based on
queries executed solely on the subset of popular stuff. Only queries
with insufficient matches in popular content will resort to querying the
older stuff.

Cheers
Mark

Vladimir Olenin wrote:
> Hi.
>  
> I couldn't find the answer to this question in the mailing list
archive.
> In case I missed it, please let me know the keyword phrase I should be

> looking for, if not a direct link.
>  
> All the 'Lucene' powered implementations I saw (well, primarily those 
> utilizing Solr) return exact count of the number of documents found. 
> It means that the query is resolved across the whole data set in 
> precise fashion. If the number of searched documents is huge (eg, > 
> 1billion), this should present quite a problem. I wonder if that's the

> default behaviour of Lucene or rather the frameworks that utilize it? 
> Is it possible to:
>  
> - get the top 1000 results WITHOUT executing query across whole data 
> set
> - in other words, can Lucene:
>   - chunk out top X results by 'approximate' fast search, which will 
> return _approximate_ total number of found documents, similar to 
> 'Google' total pages found count
>   - and perform more accurate search within that chunk
>  
> Is such functionality built in or it has be customized? If it's 
> built-in, what algorithms are used to 'chunk out' the results and get 
> approximate docs count? What classes should I look at?
>  
> Thanks!
>  
> Vlad
>  
> PS: it's pretty much the functionality Google has - you can't get more

> than 1000 matches per query (meaning, you can get even '10M' documents

> found, but if you'll try to browse beyond '1000' results, you'll get 
> an error page).
>
>   

___________________________________________________________
To help you stay safe and secure online, we've developed the all new
Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: how to get results without getting total number of found documents?

Reply via email to