On Fri, 27 Aug 2004 [EMAIL PROTECTED] wrote:

Just did my first run of rundig noticed that the search results were bring
up duplicates like:

http://www.digitalhit.com/cr/reneezellweger
http://www.digitalhit.com/cr/reneezellweger/

and

http://www.digitalhit.com/academy/73/index.shtml
http://www.digitalhit.com/academy/73/

Anyway to eliminate or weed those out?

For the second case, take a look at the following.

  http://www.htdig.org/attrs.html#remove_default_doc

This attribute allows you to specify that index.shtml is to be treated as a default document. Once you do that (and reindex) the index.shtml should be stripped before making the request. That should eliminate the duplication.

For the first case, I am not certain what is happening. I suspect there is an issue with the way the web server is configured. Typically a web server will respond with some sort of "moved" status code (e.g. 301) and a pointer to a new location when a URL ending with a directory name is provided without a trailing slash. For example, a request for

  http://www.digitalhit.com/cr/reneezellweger

should result in a moved status code and a new location of

  http://www.digitalhit.com/cr/reneezellweger/

htdig will drop the first due to the returned status code and then try
to request the second. If in your case both are being indexed, the most
likely cause is that the web server is configured in a non-standard way
(e.g. special rewrite rules) and is returning the same document for both
cases.

Jim


------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to