Hola,
>
> I don't know. Personally...
OK. Let's suspend there this stimulating discussion :)
> ... If someone wants to
> write out a user query parser, I'd be glad to take it...
>
<verbose=on>
I'd like to announce a first -working- version of the Query
parsing/evaluation framework I menaced to implement last week. It is
annexed to this message.
Currently the framework is composed by
- Parsers for
And/Or expressions
Boolean expressions with predefined operator precedence
Boolean expressions, left-to right associativity
- Query objects:
ExactWord
Operators:
Phrase
And
Or
Near
Not
- A modified version of DocMatch and ResultList, prepared for dealing with
near/phrase operators.
Four parsers are initially there. They descend from an abstract Parser
class, which contains common code dealing with single word/phrase parsing.
The syntax below corresponds to 'all'/'any' parsers.
expr == factor { factor }
factor == word | '"' phrase '"'
phrase == word { word }
word == well, you know :)
A BooleanQueryParser uses the syntax below. It assigns to each operator a
different precedence.
expr == andlist { 'or' andlist }
andlist == notlist { 'and' notlist }
notlist == nearlist { 'not' nearlist }
nearlist == factor { 'near' [ '/' number ] factor }
factor == word | '"' phrase '"' | '(' expr ')'
Another parser called GQueryParser defines a left-to-right equal-precedence
syntax:
expr == factor { filter }
filter == 'and' factor | 'or' factor | 'not' factor | 'near' factor
factor == word | '"' phrase '"' | '(' expr ')'
Query objects are a 'family' too:
WordQuery objects return the corresponding ResultLists by calling a wrapper
WordSearcher class. Database access is roughly based on existing code. It
is done anyway by the wrapper class as yous suggested, so it's fairly
independent from the rest.
OperatorQuery objects work on ResultLists. They are all evaluated in n-ary
fashion, (n is restricted to 2 for 'near', right now). As you signalled,
this is clearly better than binary ops, especially for 'and'. Consequently,
The parsing is done so that
a and b and c or d or e not "f g" not h near i
would be translated to a Query object tree like
Or(And(a, b, c), d, Not(e, Phrase(f, g), Near(h, i))
(or
Near(Not(Or(And(a, b, c), d, e), Phrase(f, g), h), i)
depending of the chosen boolean parser :)
'AndParser', 'OrParser' generate a single Operator on the top of a
sequence of words/phrases. A 'NearParser' could be included also.
Each parser may have its own lexer. Initially, a family of lexers based on
a hacked version of yours in ParseTree is included.
Result caching is supported, but is not persistent between eventual program
invocations. I guess it's feasible to support a cache between invocations,
too. The cache is basically a Dictionary of ResultLists indexed by the
corresponding parsed query string (GetLogicalWords), as you pointed. Each
Query object will lookup the cache before trying to evaluate itself.
(problem: the order of factors is important here. Query strings 'foo and
bar' and 'bar and foo' will store results two different cache entries.)
Caching is done at *all* query levels, so
'(a and (b or c)) not (d near x))'
will cache results for
a, b, c, d, x, (b or c), (d near x), (a and (b or c)), and the full thing.
The work of composing fuzzy combinations is done by an extensible family of
query factories (FuzzyExpander). The currently implemented Fuzzy expander
performs currently an 'Or' of the results from the different configured
fuzzy algos. I haven't studied enough Gilles' suggestion about fuzzy
chaining, but it can be down by deriving a class from FuzzyExpander. Fuzzy
expansion is performed at the moment of query parse.
No weighted scoring is performed. During query evaluation, only an ordered
list with match locations, flags and weights from word references is stored
with each DocMatch. Thus, weighted scoring per match can be done after
evaluation. Some special manipulation of flags and weights is done by
operators Phrase and Near.
Further optimisations are to be studied. For instance, it would be nice to
be able to simplify things such as 'a or a = a', 'a not a = 0', etc...
before or during evaluation.
Further extensions might include "fuzzy phrases" or an altavista/google
style parser. The first implies only a syntax extension. The second new
operators also -- never mind :)
The fact of having all the match locations could be useful also to perform
*proximity* scoring after evaluation, as you remarked.
Well, it's already under the first tests, but IMHO the results are
promising.
Phrase search is *quite* faster. Testing phrase "las del" takes about 2
secs on the GTD intranet's digger database (match counts for each word and
the phrase are resp. 70270, 112379 and 82). This query makes the current
version to time out.
I'll be testing/cleaning/commenting the code next week(s?); I even promise
a nice UML diagram with all the stuff. I'd love to have feedback on the
architecture and/or the code itself.
Saludos,
// Joaquim Sanmarti
// GTD Ingenieria de sistemas y software industrial, S.A.
// c/Rosa Sensat 9-11
// 08005 Barcelona SPAIN
// Tel. +34 93 225 77 00
// Fax. +34 93 225 77 08
// mailto:[EMAIL PROTECTED]
// http://www.gtd.es
QueryParser.tar.gz
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.