Vlad,

Please check published papers on sampling inverted indexes and multi-level caching - this is most probably what Google and other major search engines use.

You can see a simple implementation of this principle in Nutch - the index is sorted in decreasing order by a PageRank-like score (the logic for this is in IndexSorter.java), and then when running a query we only collect top-N results, and extrapolate total numbers over the whole collection, assuming certain model of term distributions (LuceneQueryOptimizer.LimitedCollector).

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to