Unfortunately I have not had anywhere near the time I wanted for
actual coding over the last few weeks. I'm leaving for three weeks on
Saturday and I do not expect that I'll have e-mail access during that
time.
So I uploaded the ParseTree code and a test program as they stand.
I'd qualify it is a buggy pile of half-started ideas personally. If
anyone is willing to look through the code and do some cleanup and
fleshing out, it would be greatly appreciated. Otherwise I will pick
up after I return and finish it for a 3.2.0b3 release.
I'll quickly outline what it's *supposed* to do. The ideas are
similar to those outlined by myself and Andrew earlier. The ParseTree
class is the general boolean parser, with subclasses for individual
methods. Right now I implemented AND, OR, and EXACT. Once these are
better fleshed out, it won't be too hard to implement others.
Performing a boolean parse is a bit strange--you set up the top node
and then it performs a bottom-up parse of the query and inherits that
as its child. The main code can than use the root node for top-down
queries and so on.
For consistency sake, I followed two rules. First, a generic
ParseTree functions essentially as parentheses in a boolean query and
so if a ParseTree node will at most have one child. Also, all leaves
of any type of ParseTree will be generic ParseTree nodes. I could
have implemented a WordParseTree or something, but this seemed to be
easy enough. This latter point is to make query caching easier.
So what needs to be done in order to make this fully functional
(besides fixing all the bugs)? First off, I use the HtWordToken
function, though I probably should be using a special function for
the boolean parse itself that makes sure key characters are never
ignored and "(" and the like are treated as individual tokens. I
probably also want a separate tokenizer for the derived classes too
since this will make AltaVista-style queries easier (i.e. break
"+word" into "+ word"
I also haven't spent any time working on actual query-retrieval.
Ultimately, I'd like to put this code into the WeightWord class
itself and outside the ParseTree hierarchy. Then an individual
ParseTree class will have the logic to combine results and cache as
necessary. The reason for making the nodes all of one type is that a
query for word1 AND word2 v. word1 OR word2 will still used cached
queries.
As part of putting the query logic into the WeightWord class, I'll
also have this class be responsible for field-restricted queries.
Basically it should recognize "field:word" and parse out the field
and turn it into a bit mask. This puts the field code at the same
level as the query code, where I think it belongs.
Anyway, I don't know if I'll have time to work on this on Friday.
However, I'll try to respond to questions, queries and complaints
before I leave and if I can, I'll check e-mail while I'm gone.
-Geoff
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.