On Fri, 27 Aug 2004 [EMAIL PROTECTED] wrote:
Just did my first run of rundig noticed that the search results were bring up duplicates like:
http://www.digitalhit.com/cr/reneezellweger http://www.digitalhit.com/cr/reneezellweger/
and
http://www.digitalhit.com/academy/73/index.shtml http://www.digitalhit.com/academy/73/
Anyway to eliminate or weed those out?
For the second case, take a look at the following.
http://www.htdig.org/attrs.html#remove_default_doc
This attribute allows you to specify that index.shtml is to be treated as a default document. Once you do that (and reindex) the index.shtml should be stripped before making the request. That should eliminate the duplication.
For the first case, I am not certain what is happening. I suspect there is an issue with the way the web server is configured. Typically a web server will respond with some sort of "moved" status code (e.g. 301) and a pointer to a new location when a URL ending with a directory name is provided without a trailing slash. For example, a request for
http://www.digitalhit.com/cr/reneezellweger
should result in a moved status code and a new location of
http://www.digitalhit.com/cr/reneezellweger/
htdig will drop the first due to the returned status code and then try to request the second. If in your case both are being indexed, the most likely cause is that the web server is configured in a non-standard way (e.g. special rewrite rules) and is returning the same document for both cases.
Jim
------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

