On Thu, 23 Sep 1999, J. op den Brouw ([EMAIL PROTECTED]) wrote: > Some questions > > On Wed, 22 Sep 1999, Geoff Hutchison wrote: > > > * When indexing, htdig should now attempt to index compound words as > > separate words in addition to a compound word. For example, > > "pdf_parser" would also be indexed as "pdf" and "parser." > > * Once again, thanks to everyone who reported bugs and bug fixes. > > How does htdig know that the _ is a word splitter, and is a . (dot) also > a word splitter..... > > The valid_puctuation removes these characters from a word, is it not? My compound word fix uses any character in valid_punctuation as a word separator. It does this before the punctuation is stripped out of the words. When it encounters compound words, with the words separated by valid punctuation characters, it puts the entire word in the database, as it did before, but now it also adds all combinations of parts. Of course, it strips off all punctuation before adding any word or part to the database. Here's the write-up I had when I first posted the patch to the list: This patch improves htdig's handling of compound words, like post-doctoral and such, to add each individual part, as well as the whole, into the word database. This allows searches for individual parts, like "doctoral", to find those parts in hyphenated (or otherwise punctuated) compound words. It should also fix the problem with "d'" in French text. The code seems quite convoluted because it's designed to handle all the combinations of parts in multi-hyphen-compound-words. To expand on that last example, here are the words it'll add to the database (which will get truncated to maximum_word_length characters): multihyphencompoundwords multi hyphen compound words multihyphen hyphencompound compoundwords multihyphencompound hyphencompoundwords -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.