Re: [htdig] What is a word?

David Adams Fri, 3 Sep 1999 05:42:44 -0700

> 
> According to David Adams:
> > I am using htdig 3.1.2, and my config file includes:
> > 
> > extra_word_characters:  _
> > valid_punctuation:      !@#$%^&*()-+|~=`{}[]:";'<>?,./
> > 
> > I find that the word database build by htdig includes many words that
> > contain or end in a comma or other punctuation. For example:
> > 
> > arts,   i:2514  l:1     w:49950
> > assessed,       i:2523  l:1     w:49950
> > atmospheric,    i:2529  l:1     w:49950
> > b.sc,   i:120   l:1     w:49950
> > b.sc,   i:16406 l:1     w:49950
> > b.sc,   i:16409 l:1     w:49950
> > b.sc,   i:3039  l:1     w:49950
> > b.sc,   i:3040  l:1     w:49950
> > b.sc,   i:3041  l:1     w:49950
> > ba,     i:17    l:1     w:49950
> 
> I believe part of the problem may be the left quote (`) character
> in the list above, which is taken as the start of a file expansion
> (e.g. `filename`).  As there's no file called "{}[]:";'<>?,./", the left
> quote and everything after it is lost from the valid_punctuation list.
> You'd need to escape the left quote with a backslash (\).  The same
> thing goes for the dollar sign ($), only in this case it's just that
> one character that's lost.
> 
> Still, that wouldn't explain why the comma and period get entered into
> the database.  This would suggest that those characters were in the
> extra_word_characters list, or were erroneously treated as alphanumeric
> by your locale's LC_CTYPE tables.
> 
> > Am I misunderstanding the documentation on "valid_punctuation"?
> > 
> > I can't figure out how the configuration file attributes 
> > 
> >     extra_word_characters 
> > and
> >     valid_punctuation 
> > 
> > work together.  What happens when the same character is in both?
> 
> The lists should not overlap, but if they do, I believe valid_punctuation
> overrides, so the overlapping characters do get stripped out of the word.
> 
> Essentially, both lists indicate which punctuation marks or other
> characters can be used within a word, but the valid_punctuation characters
> get stripped out before the word is put in the database.  E.g. words like
> post-doctoral and nuts&bolts go into the database as postdoctoral and
> nutsbolts, unless you move the hyphen or ampersand from valid_punctuation
> to extra_word_characters, in which case the characters stay in the word.
> 
> Additionally, with the compound word patch I posted last week, and which
> will be in future releases, the word will be split up at places that
> have a non-alphanumeric character that's in valid_punctuation, but not
> in extra_word_characters.  Thus, a word like post-doctoral will go into
> the database as postdoctoral, post and doctoral.
> 
> > Why doesn't the documented list of default characters for
> > valid_punctuation include the question mark (?) and the doublequote (")?
> 
> This is because these characters aren't commonly used within words,
> unlike apostrophes, ampersands, hyphens and slashes.  Also, when you set
> allow_numbers to index numbers as words, these numbers may contain some of
> these characters:  .-/#$% , and that's why they're in the default list.
> I don't know why _!^ are in the default list, but I suspect they may be
> used for indexing source code.  If a given punctuation mark should ALWAYS
> separate words, it should not be added to this list.
> 
> > What separates words, is it whitespace only?
> 
> White space or any punctuation character (actually, any non-alphanumeric
> character) not listed in extra_word_characters or valid_punctuation.
> 
> -- 
> Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
> Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
> Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
> 

Thanks for you swift and full response Gilles, I was certainly mistaken
as to the use of valid_punctuation.

I am left with four points:


1)      Documentation

The ht://dig documentation is excellent, but could I suggest the
following text to replace the "description" of valid_punctuation in the
online documentation:

        Any punctuation character (that is, any non-alphanumeric character,
        see allow_numbers) not either in extra_word_characters or
        valid_punctuation is treated the same as a space - it merely
        acts as a word separator.

        However, when a valid_punctuation character occurs within a word
        it is removed leaving a single word.

        For example, if the minus sign is in valid_punctuation, then the
        word "post-war" will be indexed as "postwar", and a search for
        either "post-war" or "postwar" will find it.  However, if the minus
        sign is not in valid_punctuation then "post-war" will result in
        "post" and "war" being indexed instead.


2)      Characters in valid_punctuation

Not only should I have had \` and \$ in valid_punctation but I should
not have included the star (*) atall.  This is the default
prefix_match_character and explicitly placing in it valid_punctuation
stops a "prefix" search from working. 


3)      Non-alphanumeric characters in index

I am using an SGI system running IRIX 6.5, and the locale is:
LANG=POSIX
LC_COLLATE="C"
LC_CTYPE="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_MESSAGES="C"
LC_ALL=

and LC_CTYPE is charmap="ISO8859-1"

I have rebuilt our index with a revised config file using:

extra_word_characters:  _/
valid_punctuation:      -^'\`~

and I get just as many entries for words containing '.' or ',',
plus some words containing '(', for example.

I believe this points to a bug in htdig 3.1.2.

The number of such words is relatively few: out of over 2 million
entries in the wordlist file only 127 contain '(' and less than five
hundred contain ',' or '.'.  All the entries for these words have a w:
(weighting ?) of 49950 or larger.  I've searched for a few of these
words and all occur in either META keywords or META contents, which are
scored highly.  Could there be a bug specific to the processing of the
text between <HEAD> and </HEAD>?
        

4)      extra_word_characters

If one wishes to index "Andrew's" as "andrew's" by including ' in
extra_word_characters then the text

        'Hello there'

is indexed as "'hello" and "there'".  Is there a way around this?

-- 
 
David Adams
Computing Services
University of Southampton

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.
Re: [htdig] What is a word?

Reply via email to