Greetings all,

I'm checking through phrase searching, and have found several possible 
bugs.  First, some questions...

1. Why do the documentation for  external_parser  and the comments 
before  Retriever::got_word  both say that the word location must be 
in the range 0-1000?   The HTML parser doesn't stick to that.  If 
locations are just scaled down (rather than reduced modulo 1001), 
that will break the phrase searches.  Is there a maximum in practice?

2. Every "meta" data entry (<title>, <meta ...> etc.) gets added as if 
it starts at location 0.  This gives *heaps* of false-positives, 
because the second word of *any* entry is deemed adjacent to the 
first word of any *other* entry.  Could we add "meta" information at 
successive locations starting from, say, location 10,000?

3. With phrase searching, do we still need  valid_punctuation?  For 
example, "post-doctoral" currently gets entered as three words at the 
*same* location:  "post", "doctoral" and "postdoctoral".  Would it be 
better to convert queries for  post-doctoral  into the phrase "post 
doctoral" in queries, and simply the words  post  and  doctoral  at 
successive locations in the database?  As it stands, a search for 
"the non-smoker" will match "the smoker", since all the words are 
given the same position in the database, but a search for "the non 
smoker" won't match "the non-smoker".  This also reduces the size of 
the database (marginally in most cases, but significantly for 
pathological documents).  Now that there is phrase searching, is 
there any benefit of the current approach?

4. Does anybody know what the existing external parsers do about words 
less than the minimum length?  Because they are passed the 
configuration file, they *could* omit them.  Currently the HTML 
parser omits them, but that introduces false-positives into phrase 
queries, and I want to fix that.

Thanks!
Lachlan


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to