RE: [htdig3-dev] htsearch rewrite

Quim Sanmarti Thu, 31 Aug 2000 08:22:18 -0700

Hola,

>
> I don't know. Personally...

OK. Let's suspend there this stimulating discussion :)

> ... If someone wants to
> write out a user query parser, I'd be glad to take it...
>

<verbose=on>

I'd like to announce a first -working- version of the Query 
parsing/evaluation framework I menaced to implement last week. It is 
annexed to this message.

Currently the framework is composed by
        - Parsers for
                And/Or expressions
                Boolean expressions with predefined operator precedence
                Boolean expressions, left-to right associativity
        - Query objects:
                ExactWord
                Operators:
                        Phrase
                        And
                        Or
                        Near
                        Not
        - A modified version of DocMatch and ResultList, prepared for dealing with 
near/phrase operators.

Four parsers are initially there. They descend from an abstract Parser 
class, which contains common code dealing with single word/phrase parsing.

The syntax below corresponds to 'all'/'any' parsers.

expr == factor { factor }
factor == word | '"' phrase '"'
phrase == word { word }
word == well, you know :)

A BooleanQueryParser uses the syntax below. It assigns to each operator a 
different precedence.

expr == andlist { 'or' andlist }
andlist == notlist { 'and' notlist }
notlist == nearlist { 'not' nearlist }
nearlist == factor { 'near' [ '/' number ]  factor }
factor == word | '"' phrase '"' | '(' expr ')'

Another parser called GQueryParser defines a left-to-right equal-precedence 
syntax:

expr == factor { filter }
filter == 'and'  factor | 'or' factor | 'not'  factor | 'near' factor
factor == word | '"' phrase '"' | '(' expr ')'


Query objects are a 'family' too:

WordQuery objects return the corresponding ResultLists by calling a wrapper 
WordSearcher class. Database access is roughly based on existing code. It 
is done anyway by the wrapper class as yous suggested, so it's fairly 
independent from the rest.

OperatorQuery objects work on ResultLists. They are all evaluated in n-ary 
fashion, (n is restricted to 2 for 'near', right now).  As you signalled, 
this is clearly better than binary ops, especially for 'and'. Consequently, 
The parsing is done so that

        a and b and c or d or e not "f g" not h near i

would be translated to a Query object tree like

        Or(And(a, b, c), d, Not(e, Phrase(f, g), Near(h, i))

(or
         Near(Not(Or(And(a, b, c), d, e), Phrase(f, g), h), i)
depending of the chosen boolean parser :)

'AndParser', 'OrParser'  generate a single Operator on the top of a 
sequence of words/phrases. A 'NearParser' could be included also.

Each parser may have its own lexer. Initially, a family of lexers based on 
a hacked version of yours in ParseTree is included.

Result caching is supported, but is not persistent between eventual program 
invocations. I guess it's feasible to support a cache between invocations, 
too. The cache is basically a Dictionary of ResultLists indexed by the 
corresponding parsed query string (GetLogicalWords), as you pointed. Each 
Query object will lookup the cache before trying to evaluate itself. 
(problem: the order of factors is important here. Query strings 'foo and 
bar' and 'bar and foo' will store results two different cache entries.) 
Caching is done at *all* query levels, so
        '(a and (b or c)) not (d near x))'
will cache results for
        a, b, c, d, x, (b or c), (d near x), (a and (b or c)), and the full thing.

The work of composing fuzzy combinations is done by an extensible family of 
query factories (FuzzyExpander). The currently implemented Fuzzy expander 
performs currently an 'Or' of the results from the different configured 
fuzzy algos. I haven't studied enough Gilles' suggestion about fuzzy 
chaining, but it can be down by deriving a class from FuzzyExpander. Fuzzy 
expansion is performed at the moment of query parse.

No weighted scoring is performed. During query evaluation, only an ordered 
list with match locations, flags and weights from word references is stored 
with each DocMatch. Thus, weighted scoring per match can be done after 
evaluation. Some special manipulation of flags and weights is done by 
operators Phrase and Near.

Further optimisations are to be studied. For instance, it would be nice to 
be able to simplify things such as 'a or a = a',  'a not a = 0', etc... 
before or during evaluation.

Further extensions might include "fuzzy phrases" or an altavista/google 
style parser. The first implies only a syntax extension. The second new 
operators also -- never mind :)

The fact of having all the match locations could be useful also to perform 
*proximity* scoring after evaluation, as you remarked.

Well, it's already under the first tests, but IMHO the results are 
promising.
Phrase search is *quite* faster. Testing phrase "las del" takes about 2 
secs on the GTD intranet's digger database (match counts for each word and 
the phrase are resp. 70270, 112379 and 82). This query makes the current 
version to time out.

I'll be testing/cleaning/commenting the code next week(s?); I even promise 
a nice UML diagram with all the stuff. I'd love to have feedback on the 
architecture and/or the code itself.

Saludos,




//  Joaquim Sanmarti
//    GTD Ingenieria de sistemas y software industrial, S.A.
//        c/Rosa Sensat 9-11
//        08005 Barcelona SPAIN
//        Tel. +34 93 225 77 00
//        Fax. +34 93 225 77 08
//    mailto:[EMAIL PROTECTED]
//    http://www.gtd.es

QueryParser.tar.gz

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.

RE: [htdig3-dev] htsearch rewrite

Reply via email to