According to Charles Dill: > I'm creating a new site for a client who wishes to implement a search > engine. The clients current site is running on an Apache server with > htdig already installed. The new site will replace the old one on that > server. > > Since I've yet to use htdig, or implement a search engine for that > matter, I was hoping I could get some help. Basically I'm wondering what > I'll need to do, if anything, to the individual web pages in preparation > to the search implantation and configuration.
In many, or even most cases, you don't need to do anything to the web pages to prep them for indexing. htdig parses standard HTML to get the words into the index, and follows HTML links to other documents to get at the whole linked collection of documents on your site. Things to look out for are: 1) htdig doesn't parse JavaScript code, so if your navigation in done that way, you may need to find another way to get htdig to know what documents to parse. See http://www.htdig.org/FAQ.html#q5.18 2) htdig uses the text between <title> and </title> tags in displaying search results, and by default gives words in there a higher score in the results. Make sure your documents use these appropriately. 3) htdig can make use of meta keywords and meta description tags, so you may want to use these too. 4) htdig can parse PDFs and other document types with external converters or parsers. See http://www.htdig.org/FAQ.html#q4.8 and 4.9 to see how to set these up. > Also will I need access to > any files that may not reside on root in order to configure the > searcher? The web hosting company wasn't much help and simply stated > they want me to use htdig since they are already using it. I'm not quite sure what you mean by this. If you have to use htdig and htsearch as the hosting company set them up, then you should ask for any documentation they have that deals with the specifics of configuring it on their site. If they're simply recommending htdig, but you're going to set up your own copy of it, then you can set it up so that nothing is root-dependent. The main configuration issue that makes the package dependent on a specific directory or set of directories is the issue of where htsearch's CONFIG_DIR is located, i.e. in which directory it looks for configuration files. Apart from that, everything else can pretty easily be defined or overridden on the command line or in your own configuration file. See http://www.htdig.org/FAQ.html#q5.30, 5.32 and 4.2 for more information. Question 4.18 may also provide useful pointers. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas - http://devcon.sprintpcs.com/adp/index.cfm?source=osdntextlink _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

