Doug Cutting wrote:

Byron Miller wrote:

On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)


Both. The highest-scoring pages are kept in separate indexes that are searched first. When a query fails to match 1000 or so documents in the high-scoring indexes then the entire dataset is searched. In general there can be multiple levels, e.g.: high-scoring, mid-scoring and low-scoring indexes, with the vast majority of pages in the last category, and the vast majority of queries resolved consulting only the first category.

What I have implemented so far for Nutch is a single-index version of this. The current index-sorting implementation does not yet scale well to indexes larger than ~50M urls. It is a proof-of-concept.

A better long-term approach is to introduce another MapReduce pass that collects Lucene documents (or equivalent) as values, and page scores as keys. Then the indexing MapReduce pass can partition and sort by score before creating indexes. The distributed search code will also need to be modified to search high-score indexes first.


The WWW2005 conference presented a couple of interesting papers on the subject (http://www2005.org), among others these:

1. http://www2005.org/cdrom/docs/p235.pdf
2. http://www2005.org/cdrom/docs/p245.pdf
3. http://www2005.org/cdrom/docs/p257.pdf

The techniques described in the first paper are not too difficult to implement, especially the Carmel's method of index pruning, which gives satisfactory results at moderate costs.

The third paper, by Long & Suel, presents a concept of using a cache of intersections for multi-term queries, which we already sort of use with CachingFilters, only they propose to store them on-disk instead of limiting the cache to relatively small number of filters kept in RAM...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to