RE: [htdig3-dev] htsearch rewrite

Quim Sanmarti Mon, 04 Sep 2000 03:46:48 -0700
> -----Mensaje original-----
> De:   Geoff Hutchison [SMTP:[EMAIL PROTECTED]]
> Enviado el:   jueves 31 de agosto de 2000 21:02
> .. I'll put this in...

Thanks a lot. I'm happy and proud of being of some help.
Expect a more thoroughly commented and cleaner release real soon.
There are pending issues, too: I suspect that there's some memory leak 
somewhere, and that the lexer stuff needs some review. Well, *at least* 
these :)

BTW, is there any project standard concerning comments, indentation, 
naming, etc that I must use?

>
> One of the popular requests is for a method=exact, which should be added
> but shouldn't be hard.
>
It isn't. A specific parser can be derived from QueryParser, which does 
already implement phrase parsing. This is elegant overkill, since the work 
can be done just by quoting the query string before passing it to one of 
the existing parsers. The resulting Query tree will be identical.

> I think the idea is to allow people to turn on a small BerkeleyDB (or
> other indexed file) for pre-parsed results. Querying for a word is only
> going to be faster if it has pre-computed scores, but the biggies would 
be
> for already computed And/Or/Near/Not levels and of course for the query
> itself. The latter is obviously a big win when you go to Page 2 of the
> results. :-)
>
Yessir. The question now is to define the cache policy and its parameters. 
Size? Expiration by age, LRU? What else?

> > (problem: the order of factors is important here. Query strings 'foo 
and
> > bar' and 'bar and foo' will store results two different cache entries.) 
>
> I think this may be easier, esp. if we're scoring proximity. Often a user
> entering foo and bar wants things scored slightly differently than bar 
and
> foo.
>
Well, the 'Near' operator implementation is symmetric right now, so 'foo 
near bar' yields the same results as 'bar near foo'. Isn't this OK?

My original issue is to find unique cache indexes anyway. I'm thinking that 
a possible solution is to implement OperatorQuery::Signature slightly 
different to OperatorQuery::GetLogicalWords, so that symmetric operands are 
lexicographically sorted. Thus,

Signature of (b and c and a) --> "OperatorQuery:(a and b and c)"
Signature of (z not b not a) --> "OperatorQuery:(z not a not b)"


> > ...fuzzy algos. I haven't studied enough Gilles' suggestion...
>
> I think we'll want to hold off on this a bit...
>
OK. Just wanted to remark that new fuzzy combination policies can be 
implemented without touching the existing code.

> > Further optimisations are to be studied. For instance, it would be nice 
to
> > be able to simplify things such as 'a or a = a',  'a not a = 0', etc... 
> > before or during evaluation.
>
> I don't know how often this would kick in. Maybe if people are entering
> things like "what is the weight of the moon" -> "what or is or the or
> weight or of or moon" Plus, if you're doing query caching, you're going 
to
> get a cache hit for the second item.
>
Yes. I just was thinking about avoiding cpu-intensive work done by 
operators *after* the cache hit. When you evaluate 'a and a' you must 
iterate the results of the first 'a' and fetch each DocMatch in the results 
of the second 'a' to see if they are there. Hmm, they should :)
You would note this only in special cases, e.g. when repeating words in a 
query, which can be induced by fuzzy algorithms.

The result cache can be a help here, since the evaluation of 'a' will yield 
always the *same* pointer to a result list. Thus, the optimisation can be 
based on very fast pointer comparisons.

Never mind, this is a question of detail. I'll try to advance it iff I find 
some extra time.

>
> It looks like I'll want to work on WordQuery et. al. I promised long ago
> to allow returning all documents for a query of "*" as well as 
restricting
> words to specific flags like "title:Foo."
>
Great. Let me point at some hints:

'title:' and the like (i.e. flag restrictions) can be implemented either as 
result generators as ExactWordQuery is, or as (unary) filtering operators, 
derived from OperatorQuery. I'd rather suggest the later, because
1.- A generator (title:<word>) implies iterating all WordReferences to the 
word anyway.
2.- An operator (title:<expression>) is more flexible. Phrases or full 
boolean expressions can be filtered by flags. This way, you might write
title:foo
title:"foo bar baz"
title:(foo or bar)
3.- The modifications to the parser(s) are simpler :)

Remember that the DocMatch location list now include the original 
WordReference flags for each location, so flag filtering becomes 
straightforward.

Now the question is whether to include this feature in all parsers.
For simple parsers, a new syntax could be:

expr == factor { factor }
factor == [ <flagname>':' ] ( word | '"' phrase '"' )
phrase == word { word }
word == well, you know :)

For boolean parsers, 'factor' level should be altered to support flags, 
too.
...
factor == [ <flagname>':' ] ( word | '"' phrase '"' | '(' expr ')' )
...

Concerning '*', isn't this a particular case of the Regex fuzzy? Sorry if 
the question is naive, I'm not well acquainted with Regex. We're using just 
'accents' and 'prefix' here at GTD. I'll review previous discussions in the 
mailing list, then I'll try to help you there.

Regards,

//  Joaquim Sanmarti
//    GTD Ingenieria de sistemas y software industrial, S.A.
//        c/Rosa Sensat 9-11
//        08005 Barcelona SPAIN
//        Tel. +34 93 225 77 00
//        Fax. +34 93 225 77 08
//    mailto:[EMAIL PROTECTED]
//    http://www.gtd.es



------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.
RE: [htdig3-dev] htsearch rewrite

Reply via email to