[htdig] how to allow handle searches for specific strings like "c++" that are important as a whole, but individually meaningless noise words

Jeff Skubick Tue, 20 Jul 1999 10:52:44 -0700

I'm currently in the process of porting the back-end software for our online career boards from NT/IIS/Index Server to Linux/Apache/ht://dig and have just run into a potentially huge problem. Being a career board, it's not uncommon for users to search for the specific string "c++". However, ht://dig seems to interpret it as "C and and", which results in an error, and wouldn't find anything anyway since we set the minimum word length to 2 (primarily to allow for searches for "NT").

Probably the "best" solution would be if there's a way in ht://dig to make a list of specific strings that will be indexed even though they would ordinarily fall through the cracks ("c++" and "AT&T" immediately come to mind as good examples). Maybe an analogue to the bad_word_list parameter called good_word_list?

I thought of a possible work-around (listed below) that I could implement without too much trouble, but I'd prefer to avoid doing it if it's not necessary because it would be a REALLY ugly kludge that would probably double or triple the time it takes to run htdig:

1. modify parsedoc to substitute strings that htdig CAN index for those that it doesn't like (like cplusplus for c++ or ATampT for AT&T) before passing the ASCII text along to htdig for indexing

2. write another program to receive the form data that would normally be sent to htsearch and perform the same substitutions on the user-entered search terms before calling htsearch itself using LWP and passing htsearch's response back to the user. (the excerpts would have the messy approximations, but the "real" document would be intact)

Thanks for your help!

Jeff Skubick

Senior Developer, CNI Career Networks

[EMAIL PROTECTED]

(www.careernet.com, www.hrsmart.com, www.cnijoblink.com)

[htdig] how to allow handle searches for specific strings like "c++" that are important as a whole, but individually meaningless noise words

Reply via email to