Vlad,
Please check published papers on sampling inverted indexes and
multi-level caching - this is most probably what Google and other major
search engines use.
You can see a simple implementation of this principle in Nutch - the
index is sorted in decreasing order by a PageRank-like score (the logic
for this is in IndexSorter.java), and then when running a query we only
collect top-N results, and extrapolate total numbers over the whole
collection, assuming certain model of term distributions
(LuceneQueryOptimizer.LimitedCollector).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]