Thanks for your explanations, Geoff :)  More questions follow.

On Saturday 01 March 2003 04:51, Geoff Hutchison wrote:
> > 1.  location must be in the range 0-1000?
> That's a 3.1-ism.
>
> > 2. Could we add "meta" information
> > at successive locations starting from, say, location 10,000?
>
> Actually, now that I think about it, a better idea is to use
> negative word locations for META information.
> As for some other arbitrary
> number--we might actually have documents that long (esp. with PDF
> indexing).

That could have its own problems.  If they are labelled -1, -2, ... 
then phrase searching would have to match *backwards* for negative 
numbers.  Then if true positions overflowed into negative numbers, 
the phrases wouldn't match.  (If such overflow is impossible with  
n-bit  numbers, we could use *unsigned* locations, and count forward 
from 2^(n-1) for meta information.)  If we count *forward* from a 
very negative number, then it is essentially starting from a very 
large (unsigned) location.  Thoughts?

> > 3. With phrase searching, do we still need  valid_punctuation? 
> > For example, "post-doctoral"
>
> This is a strange example. What if I had a hyphenated word? I don't
> know that your "phrase conversion" is the best solution. What we do
> need is a flexible "word parser" that addresses some of these
> issues.

I suppose a key is how often people do phrase searches vs word 
searches.  Optionally-hyphenated words are trouble-prone since the 
status-quo gives oh-so-many fasle-negatives for non-hyphenated 
phrase-queries applied to over-hyphenated text...  (The suggestion 
was based on what google does.)

Regarding flexibility, we could make  htsearch  treat words separated 
by "invalid" puctuation (but no spaces) as a phrase, and make the 
default  valid_punctuation  empty.  That way people who want the 
current functionality can have it (except queries where words are not 
separated by spaces but *should* match those words separately?) but 
the default would be less buggy for phrase searches.

> For some people, punctuation has meaning. Let's say we have part
> numbers or dates. "3/24/03" isn't really the same as "32403" and
> I'm not sure the phrase search works well either.

Ah, yes.  All three would be too short to be indexed...  But isn't 
that what  extra_word_characters  is for?

> > 4. Does anybody know what the existing external parsers do about
> > words less than the minimum length?
> I don't think most external parsers bother with the config file.


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to