RE: [htdig3-dev] htsearch rewrite

Geoff Hutchison Mon, 04 Sep 2000 12:45:41 -0700
At 12:43 PM +0200 9/4/00, Quim Sanmarti wrote:
>BTW, is there any project standard concerning comments, indentation,
>naming, etc that I must use?

Not really. Commenting often is something I've been pushing, esp. 
comments at the beginning of procedures. But so far most people have 
contributed fairly clear code. As you've probably worked out we're 
pretty open about it. It's not like we have tons of code to turn away 
(v. projects like gcc, apache, linux-kernel, etc.)

>Yessir. The question now is to define the cache policy and its parameters.
>Size? Expiration by age, LRU? What else?

Size should definitely be configurable (so that some can even turn it 
off). Expiration by age might not be a bad first approach, but 
eventually we may want to let this be configurable too. Certainly if 
we make a flexible cache architecture, someone else could come in and 
code their hearts out. :-)

(I think some larger sites may want to expire by something like 
hits/age to keep common queries around.)

And, of course everything should expire if the database modification 
time changes! Perhaps we keep a special record for the previous 
word_db mod. time?

>Well, the 'Near' operator implementation is symmetric right now, so 'foo
>near bar' yields the same results as 'bar near foo'. Isn't this OK?

Yes, but I think this may be one of the few that really should be symmetric.

>My original issue is to find unique cache indexes anyway. I'm thinking that
>a possible solution is to implement OperatorQuery::Signature slightly
>different to OperatorQuery::GetLogicalWords, so that symmetric operands are
>lexicographically sorted. Thus,

Fair enough, though I think GetLogicalWords is still a good first 
approximation. It's years better than what we have now. :-)

[on query optimizations]
>Never mind, this is a question of detail. I'll try to advance it iff I find
>some extra time.

Yeah, I'd like to push getting this integrated into htsearch first 
and then hacking on all the STATUS items. :-)

>2.- An operator (title:<expression>) is more flexible. Phrases or full
>boolean expressions can be filtered by flags. This way, you might write
>title:foo
>title:"foo bar baz"
>title:(foo or bar)
>3.- The modifications to the parser(s) are simpler :)

Good. Both of these are the direction i was heading. Certainly #2 is 
important since that's how I'd want to specify a title filter. I'll 
start with just the simple parsers and test that out. Yes, your 
modifications to DocMatch were exactly the direction I was thinking 
for the flags, etc. I can't say I had much time to code though. :-(

>Concerning '*', isn't this a particular case of the Regex fuzzy? Sorry if
>the question is naive, I'm not well acquainted with Regex. We're using just

Yes and no. It could be, but that's really slow. Since you know 
*everything* is going to match, why not just grab DocURLs() or 
DocIDs() up front? It's much faster than having to hit the word 
database at all.

-Geoff


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.
RE: [htdig3-dev] htsearch rewrite

Reply via email to