Byron Miller wrote:
On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)

Both. The highest-scoring pages are kept in separate indexes that are searched first. When a query fails to match 1000 or so documents in the high-scoring indexes then the entire dataset is searched. In general there can be multiple levels, e.g.: high-scoring, mid-scoring and low-scoring indexes, with the vast majority of pages in the last category, and the vast majority of queries resolved consulting only the first category.

What I have implemented so far for Nutch is a single-index version of this. The current index-sorting implementation does not yet scale well to indexes larger than ~50M urls. It is a proof-of-concept.

A better long-term approach is to introduce another MapReduce pass that collects Lucene documents (or equivalent) as values, and page scores as keys. Then the indexing MapReduce pass can partition and sort by score before creating indexes. The distributed search code will also need to be modified to search high-score indexes first.

Doug

Reply via email to