>
> According to David Adams:
> > I am using htdig 3.1.2, and my config file includes:
> >
> > extra_word_characters: _
> > valid_punctuation: !@#$%^&*()-+|~=`{}[]:";'<>?,./
> >
> > I find that the word database build by htdig includes many words that
> > contain or end in a comma or other punctuation. For example:
> >
> > arts, i:2514 l:1 w:49950
> > assessed, i:2523 l:1 w:49950
> > atmospheric, i:2529 l:1 w:49950
> > b.sc, i:120 l:1 w:49950
> > b.sc, i:16406 l:1 w:49950
> > b.sc, i:16409 l:1 w:49950
> > b.sc, i:3039 l:1 w:49950
> > b.sc, i:3040 l:1 w:49950
> > b.sc, i:3041 l:1 w:49950
> > ba, i:17 l:1 w:49950
>
> I believe part of the problem may be the left quote (`) character
> in the list above, which is taken as the start of a file expansion
> (e.g. `filename`). As there's no file called "{}[]:";'<>?,./", the left
> quote and everything after it is lost from the valid_punctuation list.
> You'd need to escape the left quote with a backslash (\). The same
> thing goes for the dollar sign ($), only in this case it's just that
> one character that's lost.
>
> Still, that wouldn't explain why the comma and period get entered into
> the database. This would suggest that those characters were in the
> extra_word_characters list, or were erroneously treated as alphanumeric
> by your locale's LC_CTYPE tables.
>
> > Am I misunderstanding the documentation on "valid_punctuation"?
> >
> > I can't figure out how the configuration file attributes
> >
> > extra_word_characters
> > and
> > valid_punctuation
> >
> > work together. What happens when the same character is in both?
>
> The lists should not overlap, but if they do, I believe valid_punctuation
> overrides, so the overlapping characters do get stripped out of the word.
>
> Essentially, both lists indicate which punctuation marks or other
> characters can be used within a word, but the valid_punctuation characters
> get stripped out before the word is put in the database. E.g. words like
> post-doctoral and nuts&bolts go into the database as postdoctoral and
> nutsbolts, unless you move the hyphen or ampersand from valid_punctuation
> to extra_word_characters, in which case the characters stay in the word.
>
> Additionally, with the compound word patch I posted last week, and which
> will be in future releases, the word will be split up at places that
> have a non-alphanumeric character that's in valid_punctuation, but not
> in extra_word_characters. Thus, a word like post-doctoral will go into
> the database as postdoctoral, post and doctoral.
>
> > Why doesn't the documented list of default characters for
> > valid_punctuation include the question mark (?) and the doublequote (")?
>
> This is because these characters aren't commonly used within words,
> unlike apostrophes, ampersands, hyphens and slashes. Also, when you set
> allow_numbers to index numbers as words, these numbers may contain some of
> these characters: .-/#$% , and that's why they're in the default list.
> I don't know why _!^ are in the default list, but I suspect they may be
> used for indexing source code. If a given punctuation mark should ALWAYS
> separate words, it should not be added to this list.
>
> > What separates words, is it whitespace only?
>
> White space or any punctuation character (actually, any non-alphanumeric
> character) not listed in extra_word_characters or valid_punctuation.
>
> --
> Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
>
Thanks for you swift and full response Gilles, I was certainly mistaken
as to the use of valid_punctuation.
I am left with four points:
1) Documentation
The ht://dig documentation is excellent, but could I suggest the
following text to replace the "description" of valid_punctuation in the
online documentation:
Any punctuation character (that is, any non-alphanumeric character,
see allow_numbers) not either in extra_word_characters or
valid_punctuation is treated the same as a space - it merely
acts as a word separator.
However, when a valid_punctuation character occurs within a word
it is removed leaving a single word.
For example, if the minus sign is in valid_punctuation, then the
word "post-war" will be indexed as "postwar", and a search for
either "post-war" or "postwar" will find it. However, if the minus
sign is not in valid_punctuation then "post-war" will result in
"post" and "war" being indexed instead.
2) Characters in valid_punctuation
Not only should I have had \` and \$ in valid_punctation but I should
not have included the star (*) atall. This is the default
prefix_match_character and explicitly placing in it valid_punctuation
stops a "prefix" search from working.
3) Non-alphanumeric characters in index
I am using an SGI system running IRIX 6.5, and the locale is:
LANG=POSIX
LC_COLLATE="C"
LC_CTYPE="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_MESSAGES="C"
LC_ALL=
and LC_CTYPE is charmap="ISO8859-1"
I have rebuilt our index with a revised config file using:
extra_word_characters: _/
valid_punctuation: -^'\`~
and I get just as many entries for words containing '.' or ',',
plus some words containing '(', for example.
I believe this points to a bug in htdig 3.1.2.
The number of such words is relatively few: out of over 2 million
entries in the wordlist file only 127 contain '(' and less than five
hundred contain ',' or '.'. All the entries for these words have a w:
(weighting ?) of 49950 or larger. I've searched for a few of these
words and all occur in either META keywords or META contents, which are
scored highly. Could there be a bug specific to the processing of the
text between <HEAD> and </HEAD>?
4) extra_word_characters
If one wishes to index "Andrew's" as "andrew's" by including ' in
extra_word_characters then the text
'Hello there'
is indexed as "'hello" and "there'". Is there a way around this?
--
David Adams
Computing Services
University of Southampton
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.