> -----Mensaje original-----
> De: Geoff Hutchison [SMTP:[EMAIL PROTECTED]]
> Enviado el: miercoles 2 de agosto de 2000 6:15
> Para: htdig3-dev
> Asunto: [htdig3-dev] Re: ParseTree implementation
>
> OK, I'm finally getting to a point where I have free time to code
> again. Hopefully people will see code going into the tree and become
> willing to contribute too? :-)
>
I'm delighted to see that the ParseTree implementation is coming. Really,
the htsearch module needs a good deal of rework.
I'll be testing the framework these days, so I'll try to give feedback
and/or contributions ASAP.
In the meanwhile, I'd like to express some design issues about the
framework. Sorry if they are late; I just begun recently to work with
ht://dig. Sorry too if some of them sound radical. I wouldn't like to seem
an aggressive newbie. So attack them without pity :)
1. I find a major advance the fact that the new framework works in two
phases: parse once first generating a query tree, then evaluate it if
parsing was ok.
However, I can't see why the responsibilities of parsing a query string and
evaluating the resulting query are mixed up in the same class(es). This was
already a major flaw of the old parser class. By doing like that, the
modularity and maintainability of the code becomes rather low.
I notice that the idea is to be able to use the same code to parse the
'classic' ht://dig 'and' 'or' and 'boolean' strings. But IMHO this is
restrictive and puts serious barriers to further development of query
syntaxes, (e.g. altavista-style or whatever).
If parser and resulting query are independent, different parsers can be
written to generate queries composed by predefined query elements. This is
customary in compiler/interpreter construction, where the target object
code is defined independently from the parser (usually before :).
2. I observe that a ParseTree query is implemented by a n-ary tree.
Wouldn't it be clearer and simpler to use just unary and binary nodes?
These are in fact the ones that will be used by query strings. Moreover,
when evaluating a query, result sets will actually be filterered operator
by operator in binary/unary fashion.
I'd like to propose the following (not necessarily complete) class
hierarchy:
class Query - abstract
class WordQuery - leaf for word lookups, but see 5.
class BinaryOpQuery - abstract
class AndQuery
class OrQuery
class ExceptQuery -- binary Not ('except') ht://dig style
class NearQuery
class NextQuery -- used to construct phrases, see 4.
...
class UnaryOpQuery - abstract
class NotQuery -- everything but x (deprecated?)
class HostQuery -- host:x altavista-style
class UrlQuery -- url:x altavista-style
class LinkQuery -- link:x altavista-style
...
Well, if someone does need to implement operators that actually need more
than two operands and that can't be reduced to unary/binary, the class
hierarchy could be extended :)
BTW, somewere in the code we should comment the 'truth tables' for every
operator, for input values <no_results> <results> <ignore> of the operands.
3. What makes me most happy about the tree-structured queries is the fact
that we won't use a stack for query evaluation. This was a major burden in
the old parser, since it forced to perform a lookup for each word in the
query *before* it could begin to apply filtering operators. Now it becomes
no more necessary. Query evaluation should take advantage of this. For
operators And, Except, Near, Next, there's no need of evaluating the second
operand if the first yields no results. This will be a major optimization.
I suppose that this is already the idea and has been discussed elsewhere,
but since the related code is not yet written when I write this, I felt i'd
have to remark it.
4. Note that i've included a 'Next' and 'Near' operators in my whish list
above.
'Next' would be used for phrase construction so that queries would be pa
rsed:
"a" -> WordQuery("a")
"a b" -> NextQuery(WordQuery("a"), WordQuery("b"))
and so on.
'Near' is the canonical proximity operator, which could have a parameter
giving the maximum distance between the two operand queries.
To be able to implement these queries, I propose to modify the DocMatch
class, in order to include a list (vector?) containing the match locations
for the corresponding docId. All operators will have to deal with these
lists.
Since a possible query using 'near' could be '<phrase> near <phrase>', each
element in the location list would contain both the initial and final
positions of the match. These would be identical for single words.
5. I see the application of Fuzzy filters to lookups as an open issue.
Current fuzzy algos can be modeled as operators that result in a 'or' query
of single words. Is this optimum? Wouldn't it be perferable to have 'leaf'
queries (inherited from WordQuery) that do 'specialized' lookups according
to each fuzzy algo, then merged with a single 'or'? Dunno.
6. (I'm not sure whether the following applies only to htsearch)
Altavista disposes of a powerful hypenated-word query similar to the phrase
query, so that:
foo-bar is equivalent to "foo bar" or "foo-bar" or "foobar"
but also:
foo*-bar*-baz would give results using suffix fuzzy for all (?) the
combinations above.
Personally, I find this syntax most useful, especially for english. I use
it also for phrase searching without quotes. I spare the shift key :)
Is this feasible for our dear digger?
########
Actually, I already begun to sketch an implementation of some of the ideas
above; I hope to be able to show them soon.
Saludos,
// Joaquim Sanmarti
// GTD Ingenieria de sistemas y software industrial, S.A.
// c/Rosa Sensat 9-11
// 08005 Barcelona SPAIN
// Tel. +34 93 225 77 00
// Fax. +34 93 225 77 08
// mailto:[EMAIL PROTECTED]
// http://www.gtd.es
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.