RE: [htdig3-dev] Re: ParseTree implementation

Quim Sanmarti Tue, 22 Aug 2000 09:09:52 -0700
> -----Mensaje original-----
> De:   Geoff Hutchison [SMTP:[EMAIL PROTECTED]]
> Enviado el:   miercoles 2 de agosto de 2000 6:15
> Para: htdig3-dev
> Asunto:       [htdig3-dev] Re: ParseTree implementation
>
> OK, I'm finally getting to a point where I have free time to code
> again. Hopefully people will see code going into the tree and become
> willing to contribute too? :-)
>

I'm delighted to see that the ParseTree implementation is coming. Really, 
the htsearch module needs a good deal of rework.

I'll be testing the framework these days, so I'll try to give feedback 
and/or contributions ASAP.

In the meanwhile, I'd like to express some design issues about the 
framework. Sorry if they are late; I just begun recently to work with 
ht://dig. Sorry too if some of them sound radical. I wouldn't like to seem 
an aggressive newbie. So attack them without pity :)

1. I find a major advance the fact that the new framework works in two 
phases: parse once first generating a query tree, then evaluate it if 
parsing was ok.

However, I can't see why the responsibilities of parsing a query string and 
evaluating the resulting query are mixed up in the same class(es). This was 
already a major flaw of the old parser class. By doing like that, the 
modularity and maintainability of the code becomes rather low.

I notice that the idea is to be able to use the same code to parse the 
'classic' ht://dig 'and' 'or' and 'boolean' strings. But IMHO this is 
restrictive and puts serious barriers to further development of query 
syntaxes, (e.g. altavista-style or whatever).

If parser and resulting query are independent, different parsers can be 
written to generate queries composed by predefined query elements. This is 
customary in compiler/interpreter construction, where the target object 
code is defined independently from the parser (usually before :).

2. I observe that a ParseTree query is implemented by a n-ary tree. 
Wouldn't it be clearer and simpler to use just unary and binary nodes? 
These are in fact the ones that will be used by query strings. Moreover, 
when evaluating a query, result sets will actually be filterered operator 
by operator in binary/unary fashion.

I'd like to propose the following (not necessarily complete) class 
hierarchy:

        class Query - abstract
                class WordQuery - leaf for word lookups, but see 5.
                class BinaryOpQuery - abstract
                        class AndQuery
                        class OrQuery
                        class ExceptQuery -- binary Not ('except') ht://dig style
                        class NearQuery
                        class NextQuery -- used to construct phrases, see 4.
                        ...
                class UnaryOpQuery - abstract
                        class NotQuery -- everything but x (deprecated?)
                        class HostQuery -- host:x altavista-style
                        class UrlQuery -- url:x altavista-style
                        class LinkQuery -- link:x altavista-style
                        ...

Well, if someone does need to implement operators that actually need more 
than two operands and that can't be reduced to unary/binary, the class 
hierarchy could be extended :)

BTW, somewere in the code we should comment the 'truth tables' for every 
operator, for input values <no_results> <results> <ignore> of the operands.

3. What makes me most happy about the tree-structured queries is the fact 
that we won't use a stack for query evaluation. This was a major burden in 
the old parser, since it forced to perform a lookup for each word in the 
query *before* it could begin to apply filtering operators. Now it becomes 
no more necessary. Query evaluation should take advantage of this. For 
operators And, Except, Near, Next, there's no need of evaluating the second 
operand if the first yields no results. This will be a major optimization. 
I suppose that this is already the idea and has been discussed elsewhere, 
but since the related code is not yet written when I write this, I felt i'd 
have  to remark it.

4. Note that i've included a 'Next' and 'Near' operators in my whish list 
above.

'Next' would be used for phrase construction so that queries would be pa  
rsed:
"a" -> WordQuery("a")
"a b" -> NextQuery(WordQuery("a"), WordQuery("b"))
and so on.

'Near' is the canonical proximity operator, which could have a parameter 
giving the maximum distance between the two operand queries.

To be able to implement these queries, I propose to modify the DocMatch 
class, in order to include a list (vector?) containing the match locations 
for the corresponding docId. All operators will have to deal with these 
lists.

Since a possible query using 'near' could be '<phrase> near <phrase>', each 
element in the location list would contain both the initial and final 
positions of the match. These would be identical for single words.

5. I see the application of Fuzzy filters to lookups as an open issue. 
Current fuzzy algos can be modeled as operators that result in a 'or' query 
of single words. Is this optimum? Wouldn't it be perferable to have 'leaf' 
queries (inherited from WordQuery) that do 'specialized' lookups according 
to each fuzzy algo, then merged with a single 'or'? Dunno.

6. (I'm not sure whether the following applies only to htsearch)
Altavista disposes of a powerful hypenated-word query similar to the phrase 
query, so that:
  foo-bar is equivalent to "foo bar" or "foo-bar" or "foobar"
but also:
  foo*-bar*-baz would give results using suffix fuzzy for all (?) the 
combinations above.

Personally, I find this syntax most useful, especially for english. I use 
it also for phrase searching without quotes. I spare the shift key :)

Is this feasible for our dear digger?

########

Actually, I already begun to sketch an implementation of some of the ideas 
above; I hope to be able to show them soon.

Saludos,

//  Joaquim Sanmarti
//    GTD Ingenieria de sistemas y software industrial, S.A.
//        c/Rosa Sensat 9-11
//        08005 Barcelona SPAIN
//        Tel. +34 93 225 77 00
//        Fax. +34 93 225 77 08
//    mailto:[EMAIL PROTECTED]
//    http://www.gtd.es



------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.
RE: [htdig3-dev] Re: ParseTree implementation

Reply via email to