Re: how to get results without getting total number of found documents?

markharw00d Tue, 26 Sep 2006 15:36:18 -0700

>>- get the top 1000 results WITHOUT executing query across whole data set

(Apologies if this is telling something you are already fully aware of )- Counting matches doesn't involve scanning the text of all the docs somay be less expensive than you think for a single index. It very quicklylooks up and ranks only the docs containing your search terms so a totalmatch count is not an expensive by-product of this operation - see adescription of inverted indexes for more details:http://en.wikipedia.org/wiki/Inverted_index

If you're aware of all that and considering larger scale problems(billions of docs) where multiple machines/indexes must be queried inparallel things are more complex. The cost of combining result scoresfrom multiple machines is typically why you can't page beyond 1000results. Some of these large distributed architectures will dividecontent into popular/recent content and older/less popular content.Approximations for total number of matching docs are calculated based onqueries executed solely on the subset of popular stuff. Only querieswith insufficient matches in popular content will resort to querying theolder stuff.


Cheers
Mark


Vladimir Olenin wrote:

Hi.

I couldn't find the answer to this question in the mailing list archive.

In case I missed it, please let me know the keyword phrase I should be
looking for, if not a direct link.

All the 'Lucene' powered implementations I saw (well, primarily those

utilizing Solr) return exact count of the number of documents found. It
means that the query is resolved across the whole data set in precise
fashion. If the number of searched documents is huge (eg, > 1billion),
this should present quite a problem. I wonder if that's the default
behaviour of Lucene or rather the frameworks that utilize it? Is it
possible to:

- get the top 1000 results WITHOUT executing query across whole data set

- in other words, can Lucene:
  - chunk out top X results by 'approximate' fast search, which will
return _approximate_ total number of found documents, similar to
'Google' total pages found count
  - and perform more accurate search within that chunk

Is such functionality built in or it has be customized? If it's

built-in, what algorithms are used to 'chunk out' the results and get
approximate docs count? What classes should I look at?

Thanks!VladPS: it's pretty much the functionality Google has - you can't get more

than 1000 matches per query (meaning, you can get even '10M' documents
found, but if you'll try to browse beyond '1000' results, you'll get an
error page).

___________________________________________________________To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how to get results without getting total number of found documents?

Reply via email to