OK, this week I'm going to give a brief overview of what goes on for a
search. This side is obviously very important because fast response is
critical--yet the whole process in htsearch is fairly complex. I'll
briefly touch on the new "collection" support too.

Obviously the start is quite similar to htdig, the CGI must parse the
input, not only from the config files but from the CGI environment as well.
The CGI input is folded into the configuration object, so that features
such as allow_in_form can override the defaults. This also includes forming
the list of config files and database for the active collection (if any).

Next, the CGI parses the "words" input into a set of WeightWords objects in
htsearch::setupWords. This transforms booleans terms into '&' and '|' and
so on, marking these to be ignored. This also performs some limited cleanup
such as removing spaces and normalizing the word. If the expression is
boolean, it checks the syntax in parser::checkSyntax. Otherwise, it
converts the query into a boolean expression. Finally, it runs through the
original query terms and adds any relevant fuzzy terms at appropriate
weighting. The weighting is determined by the search_algorithm attribute
and the fuzzy objects generate the alternatives through their code.

After this, it has constructed what will become the LOGICAL_WORDS variable.
It still needs to retrieve the documents matching the query, which it does
in parser::parse.  (This is probably a misnomer, but we'll ignore that for
the time being.)  This functions as a simple binary expression parser,
looking up terms from left to right and performing any boolean (or phrase)
operations.  The retrieval itself is fairly naive, grabbing all records
matching a certain word from the appropriate WordDB (interating through a
collection) and then scoring them on the fly. The word database
essentially uses the word as the key to return a list of HtWordReference
objects for all indexed words in all indexed documents (yes, this can be a
fair amount of memory). The documents are keyed by DocID, which is why it's
useful to have the DocDB keyed by DocID as outlined last week.

Some speedup could come from more sophisticated retrieval and scoring
mechanisms.  Certainly performing all boolean operations at once
(rather than pairwise) could speed up multiple term queries. Limiting large
searches could also help, along the lines of what is described in
_Managing Gigabytes_, either using a frequency-sorted or uniform
distribution of words. Result or score caches would obviously also help.

Once parser:: returns, htsearch has a list of matching, mostly scored
documents. These are in the form of DocMatch objects bound into a
ResultList, created by the parser::score() method.
This list is now passed off to Display (which currently does more than that)
for further processing.

The Display object starts off by weeding out documents that should not be
included (by exclude or restrict directives) or those that for some reason
don't exist (e.g. the WordDB and the DocDB are out of sync). Final scoring
is done for the date_factor and backlink_factor attributes and then the
list is sorted. Again, the sorting is a bit naive since it sorts the whole
list even though the template will only show a small number of results.
(Remember that most searches only look at the first page!)

Finally, the Display class starts assembling the template, reading in the
header and footer or wrapper files, filling out the template variables
that may be included, etc. (An interesting question is whether it's faster
to derive all the variables or to figure out what variables are needed for
a particular template set-up and only create those. A result caching
scheme might use this sort of information.) I skimmed over a bit, preparing
the actual excerpt variable is a bit more complex, but I'll go into that
another time.

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to