[htdig3-dev] Re: New APIs (was Shared libs)

Geoff Hutchison Mon, 19 Jul 1999 08:21:57 -0700

[EMAIL PROTECTED] wrote:
> smart and clear interface to propose at present. But I guess the interface
> will slowly appear if we start to separate the indexer and the crawler.
>  If we cannot imagine an interface between the crawler and the indexer
> regarding the link information at present, we can have the indexer and
> crawler tightly bound for that purpose.
>  I'll be more verbose on the interface after reading the existing code.

Yes, this is probably true. As for the existing crawler/indexer code,
IMHO, it's a bit crufty in places. Thus the limitation that we cycle
through the servers one by one, the strange split between Document and
Retriever, and the problems with hop_counts. (Hop_counts get rather
complex in a multi-server situation unless we do a level order traversal
by hop_count.)

>  [ in : normalized syntax tree
>    -> Query transform { Fuzzy, Thesaurus } ( may be Fuzzy + Thesaurus )
>    out: normalized syntax tree ]

Unless I misunderstand you, Thesaurus = "Synonym Fuzzy" as it presently
exists.

>  [ in : normalized syntax tree
>    -> Query solver (relevance ranked and sorted)
>    out : list of documents identifiers ]
> 
>  [ in : list of documents identifies
>    -> Document retrieval (extract document information)
>    out : list of document information ]

>  I'm not sure where the document filtering should go. Since my main concern

Filtering, by necessity, will have to go *after* document retrieval is
performed, which is a great complication. It's impossible to restrict by
date w/o retrieving the date stamps from the document db. But if you
filter afterwards, the "query solver" doesn't know how many to return
such that the query will fulfill the user's request for, e.g. the top
10.

>            2) The resolution of the query must not return a list of all
>               possible documents matching the query. It must perform the
>               relevance ranking and sorting while searching the index and
>               only return the first N documents.

This is something I already have outlined in my head. The problem is
that the "sorting" depends on what we're sorting. :-) However, given a
list of *possible* documents, the "sorting" occurs by a min-heap of size
N. Basically, you check the next document in the list, if it's bigger
than the smallest so far, you put it in the heap and drop off the
smallest.

The problem is defining 'N' given filtering as above. :-(

>   These restrictions are essential for a large scale index that is
> used under high stress. The preriquisite is quite clear: the index
> must be organized to allow this (document lists for each word must be
> sorted to allow ranking/sorting without reading the whole document list). If it
> is not, all the list must be read.

Strangely enough, this isn't quite true. Going into the details would
require a rather lengthy explanation, but anyone keen on the details
should pick up a copy of the 2nd edition of _Managing Gigabytes_ by
Moffat, Bell, and Whitten. I'm finding it well worth the investment.

>   This seems a bit idealistic but is not. All full text indexes have a
> defined usage pattern (95% of the queries on internet search engines only

I think this is a bit separate. We probably want to *cache* frequent
queries. For example, something like 90% of the queries on AltaVista are
from a list of about 1000 queries. So they just spit off a pre-computed
answer from the cache. IMHO, that doesn't mean we should necessarily be
storing wordlists in some ranked order when that ranking may change.
AFAIK, AltaVista doesn't rank by date, but we offer it (admittedly at a
performance penalty).

-Geoff
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
[htdig3-dev] Re: New APIs (was Shared libs)

Reply via email to