Re: IndexSorter optimizer

Doug Cutting Wed, 04 Jan 2006 09:36:39 -0800

Byron Miller wrote:

On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)

Both. The highest-scoring pages are kept in separate indexes that aresearched first. When a query fails to match 1000 or so documents in thehigh-scoring indexes then the entire dataset is searched. In generalthere can be multiple levels, e.g.: high-scoring, mid-scoring andlow-scoring indexes, with the vast majority of pages in the lastcategory, and the vast majority of queries resolved consulting only thefirst category.

What I have implemented so far for Nutch is a single-index version ofthis. The current index-sorting implementation does not yet scale wellto indexes larger than ~50M urls. It is a proof-of-concept.

A better long-term approach is to introduce another MapReduce pass thatcollects Lucene documents (or equivalent) as values, and page scores askeys. Then the indexing MapReduce pass can partition and sort by scorebefore creating indexes. The distributed search code will also need tobe modified to search high-score indexes first.


Doug

Re: IndexSorter optimizer

Reply via email to