Re: [htdig] What is a word?

Geoff Hutchison Mon, 6 Sep 1999 11:26:40 -0700

At 2:37 PM +0100 9/3/99, David Adams wrote:
>1)     Documentation
>
>The ht://dig documentation is excellent, but could I suggest the
>following text to replace the "description" of valid_punctuation in the
>online documentation:

Suggestions for documentation are *always* welcome.

>prefix_match_character and explicitly placing in it valid_punctuation
>stops a "prefix" search from working.

This is already fixed in the 3.2 development code. We changed the way 
those characters were stripped from the query, in part because we 
added a regex fuzzy algorithm.

>The number of such words is relatively few: out of over 2 million
>entries in the wordlist file only 127 contain '(' and less than five
>hundred contain ',' or '.'.  All the entries for these words have a w:
>(weighting ?) of 49950 or larger.  I've searched for a few of these
>words and all occur in either META keywords or META contents, which are
>scored highly.  Could there be a bug specific to the processing of the
>text between <HEAD> and </HEAD>?

I think you're probably right. I don't think there's anything that's 
stripping out valid_punctuation for that code. Grr. Thanks for the 
heads up.

>is indexed as "'hello" and "there'".  Is there a way around this?

Hmm. This is a bit of a problem. On the one hand, your example looks 
wrong. However, let's say you were indexing some mailing list 
archives for the GCC developers list. You want to index words like 
'__builtin' and '#include' and company.

So we're stuck! The extra_word_chars attribute was added exactly for 
that purpose--to index the GCC mailing lists. So there's nothing 
stopping words from having these characters at the front.

Any suggestions?

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.
Re: [htdig] What is a word?

Reply via email to