[htdig3-dev] Architecture Overview: htsearch parsing revisited

Geoff Hutchison Wed, 29 Mar 2000 21:07:38 -0800
Well, I've put this off for a while, especially as I try to wrap up 
some of the loose ends for 3.2.0b2. But since I can't connect to the 
PPP server (the line's busy), I have an excuse to write offline. :-)

Let's take a look at the code in htsearch. To be perfectly honest, 
the division of labor isn't exactly fair. Most of it is done in three 
files: htsearch.cc, parser.cc and Display.cc. Since the first isn't a 
class, it should give up some of its code and maybe we can split the 
others into a few additional classes. Please forgive me, the article 
is going to be less of an overview of the current code and more of a 
loose proposal of where htsearch should (?) go.

For the time being, we'll stick to the idea of one query per process. 
To extend the current model of three methods (and/all, or/any, 
boolean), I agree with a suggestion and propose a fourth: exact (i.e. 
the whole query is treated as a phrase search). Now however the query 
is formatted, we will need to do some transforming and parsing. I 
think the current process of turning everything into a boolean 
expression is the right way. So I propose instead of the "" operators 
used now (which I consider a kudge that I wrote) we add a "near" 
operator with a parameter defining how close the words must be to 
match.

Some examples:
Method: And, Query: For score and seven => ((For & score) & and) & seven
Method: Or, Query: "Geoff Hutchison" code => (Geoff near[1] Hutchison) or code
Method: Exact, Query: search engine codebase =>
                 (search near[1] engine) near[1] codebase
Method: Boolean, Query: (Gilles near Geoff) and ht://Dig => [same]

I'll also suggest that the transforming will allow a unary not 
operator that functions transforms as follows:
not Microsoft => * not Microsoft

So in our parsed, transformed expression, we'll have a few 
"operators," namely (, ), &, |, !, *, and ~ (the last three being 
NOT, ALL, and NEAR respectively).

Now each word token at a minimum also needs to keep track of its 
"fuzzy factor," and if necessary the "field" to search. (We can treat 
this simply as a mask for the flags.)

What I've just described is IMHO some requirements for a Parser 
class--it transforms the query into an expression tree. Then a 
Searcher class would walk the tree, iterating over a Collection to 
return a ResultList. IMHO, the scoring, sorting and "winnowing" 
(removing deleted documents, those not matching restrict or exclude 
clauses, etc.) should be done inside this ResultList class, with some 
of the work like initial scoring being done by the Searcher class.

This would leave the work of filling in the templates themselves to 
the Display class, which seems fitting. That's not to say there isn't 
a lot of work to be done for this, what with hilighting, finding 
anchors, SGML and URL encoding, etc.

You'll note that I've been rather hazy on some important issues. 
IMHO, it's very tricky to walk the expression tree optimally. After 
all, you would rather not waste the memory on having the results of 
all the searches at one time. However, if you do the searches 
pairwise, you waste time against comparing the lists all at once. But 
that's something we can talk about later.

-Geoff

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.
[htdig3-dev] Architecture Overview: htsearch parsing revisited

Reply via email to