Unfortunately I have not had anywhere near the time I wanted for 
actual coding over the last few weeks. I'm leaving for three weeks on 
Saturday and I do not expect that I'll have e-mail access during that 
time.

So I uploaded the ParseTree code and a test program as they stand. 
I'd qualify it is a buggy pile of half-started ideas personally. If 
anyone is willing to look through the code and do some cleanup and 
fleshing out, it would be greatly appreciated. Otherwise I will pick 
up after I return and finish it for a 3.2.0b3 release.

I'll quickly outline what it's *supposed* to do. The ideas are 
similar to those outlined by myself and Andrew earlier. The ParseTree 
class is the general boolean parser, with subclasses for individual 
methods. Right now I implemented AND, OR, and EXACT. Once these are 
better fleshed out, it won't be too hard to implement others. 
Performing a boolean parse is a bit strange--you set up the top node 
and then it performs a bottom-up parse of the query and inherits that 
as its child. The main code can than use the root node for top-down 
queries and so on.

For consistency sake, I followed two rules. First, a generic 
ParseTree functions essentially as parentheses in a boolean query and 
so if a ParseTree node will at most have one child. Also, all leaves 
of any type of ParseTree will be generic ParseTree nodes. I could 
have implemented a WordParseTree or something, but this seemed to be 
easy enough. This latter point is to make query caching easier.

So what needs to be done in order to make this fully functional 
(besides fixing all the bugs)? First off, I use the HtWordToken 
function, though I probably should be using a special function for 
the boolean parse itself that makes sure key characters are never 
ignored and "(" and the like are treated as individual tokens. I 
probably also want a separate tokenizer for the derived classes too 
since this will make AltaVista-style queries easier (i.e. break 
"+word" into "+ word"

I also haven't spent any time working on actual query-retrieval. 
Ultimately, I'd like to put this code into the WeightWord class 
itself and outside the ParseTree hierarchy. Then an individual 
ParseTree class will have the logic to combine results and cache as 
necessary. The reason for making the nodes all of one type is that a 
query for word1 AND word2 v. word1 OR word2 will still used cached 
queries.

As part of putting the query logic into the WeightWord class, I'll 
also have this class be responsible for field-restricted queries. 
Basically it should recognize "field:word" and parse out the field 
and turn it into a bit mask. This puts the field code at the same 
level as the query code, where I think it belongs.

Anyway, I don't know if I'll have time to work on this on Friday. 
However, I'll try to respond to questions, queries and complaints 
before I leave and if I can, I'll check e-mail while I'm gone.

-Geoff

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 


Reply via email to