I'm currently in the
process of porting the back-end software for our online career boards from
NT/IIS/Index Server to Linux/Apache/ht://dig and have just run into a
potentially huge problem. Being a career board, it's not uncommon for users to
search for the specific string "c++". However, ht://dig seems to interpret it as
"C and and", which results in an error, and wouldn't find anything anyway since
we set the minimum word length to 2 (primarily to allow for searches for "NT").
Probably the "best"
solution would be if there's a way in ht://dig to make a list of specific
strings that will be indexed even though they would ordinarily fall through the
cracks ("c++" and "AT&T" immediately come to mind as good examples). Maybe
an analogue to the bad_word_list parameter called
good_word_list?
I thought of a
possible work-around (listed below) that I could implement without too much
trouble, but I'd prefer to avoid doing it if it's not necessary because it would
be a REALLY ugly kludge that would probably double or triple the time it takes
to run htdig:
1. modify parsedoc
to substitute strings that htdig CAN index for those that it doesn't like
(like cplusplus for c++ or ATampT for AT&T) before passing the ASCII text
along to htdig for indexing
2. write
another program to receive the form data that would normally be sent to htsearch
and perform the same substitutions on the user-entered search terms before
calling htsearch itself using LWP and passing htsearch's response back to the
user. (the excerpts would have the messy approximations, but the "real" document
would be intact)
Thanks for your
help!
Jeff
Skubick
Senior Developer,
CNI Career Networks