RE: [htdig3-dev] Re: ParseTree implementation

Geoff Hutchison Tue, 22 Aug 2000 09:32:55 -0700
Thanks for your message--I've been hoping for this discussion for a 
while, but it never seemed to come up. (Yet there are 100+ people on 
this list, don't you have opinions?)

At 12:07 PM +0200 8/22/00, Quim Sanmarti wrote:
>If parser and resulting query are independent, different parsers can be
>written to generate queries composed by predefined query elements. This is
>customary in compiler/interpreter construction, where the target object
>code is defined independently from the parser (usually before :).

I actually never said the queries had to be done in the ParseTrees--I 
simply wanted them as a "holding place" for results so that 
intermediate results can be cached. The actual querying of the terms 
is probably better done elsewhere. That code hasn't been written 
because I'm still finding bugs in the ParseTree classes. :-(

That said, I'm always looking for feedback. Since the original 
discussions in February, I've received little (if any) feedback. I 
haven't had any since I actually started putting up code. So I'm glad 
to see your message.

>2. I observe that a ParseTree query is implemented by a n-ary tree.
>Wouldn't it be clearer and simpler to use just unary and binary nodes?
>These are in fact the ones that will be used by query strings. Moreover,
>when evaluating a query, result sets will actually be filterered operator
>by operator in binary/unary fashion.

Not really. Remember an "and" query is really an n-ary operation. 
Furthermore, it's often more efficient to walk several results at 
once rather than pairwise. Think about walking an AND query of 4-5 
queries with a table of each document. To combine all at once, you 
just check to make sure it's in each document and add it to the 
global result set. This is faster than log(n) combinations since you 
don't re-traverse the intermediate values. But I think this is a 
detail.

>I'd like to propose the following (not necessarily complete) class
>hierarchy:

Things like host: url: link: and (of course) field: are to be taken 
care of in the query structure. My feeling was that this should go 
into WeightWord, which would separate any sort of OP:query structure 
and set field bits accordingly.

>operators And, Except, Near, Next, there's no need of evaluating the second
>operand if the first yields no results. This will be a major optimization.
>I suppose that this is already the idea and has been discussed elsewhere,
>but since the related code is not yet written when I write this, I felt i'd
>have  to remark it.

Yes. The other major optimization is that each node in the ParseTree 
can cache the results using an appropriate key (using the 
initialQuery and type perhaps).

>To be able to implement these queries, I propose to modify the DocMatch
>class, in order to include a list (vector?) containing the match locations
>for the corresponding docId. All operators will have to deal with these
>lists.

This would be fine--it's probably more efficient to do phrase 
searching this way. Plus we could implement Google-style proximity 
weighting if we desire (i.e. later).

>5. I see the application of Fuzzy filters to lookups as an open issue.
>Current fuzzy algos can be modeled as operators that result in a 'or' query
>of single words. Is this optimum? Wouldn't it be perferable to have 'leaf'
>queries (inherited from WordQuery) that do 'specialized' lookups according
>to each fuzzy algo, then merged with a single 'or'? Dunno.

My idea was that Fuzzy would be passed down the tree so that 
individual types (i.e. phrase/Exact) can ignore all or some fuzzies. 
When they hit the leafs, these would become OrParseTree objects and 
add the appropriate alternates. I'm open to suggestions on that.

>6. (I'm not sure whether the following applies only to htsearch)
>Altavista disposes of a powerful hypenated-word query similar to the phrase
>query, so that:
>   foo-bar is equivalent to "foo bar" or "foo-bar" or "foobar"
>but also:

This could be a useful fuzzy algorithm. Right now they're not 
phrase-aware. my original idea is that Fuzzy objects can return a 
ParseTree of alternatives, so they can return phrases, normal 
OrParseTrees, etc. This would help for the synonym fuzzy too.

>Is this feasible for our dear digger?

It's not the digger in question, is it? :-)

-Geoff


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.
RE: [htdig3-dev] Re: ParseTree implementation

Reply via email to