I would have thought that the example that you give below should have been handled by the http://www.htdig.org/attrs.html#remove_default_doc setting. Have you looked into that?
As for the other part, if you know what the aliases are on the server (can you copy them from a config file?) then you can probably use the http://www.htdig.org/attrs.html#server_aliases setting. Mike > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf > Of Dennis Watson > Sent: 28 June 2005 22:58 > To: '[email protected]' > Subject: [htdig] Eliminating Duplicate Search Results > > > Hello All, > > I am using HTDig 3.1.6 on a large web site that has many > aliases for pages, > so different URLs point to the same content. This is causing > duplicate > search results since HTDig is using the URL as the unique id. > People are > also not consistent with how they write URLs so > http://www.military.com/spouse and > http://www.military.com/spouse/ (note > trailing slash) and > these are coming up as different results as well. > > I have tried a few different things like search_rewrite_rules ( > search_rewrite_rules: http://(.*)/$ http://\\1 ), but the > regex was too > greedy and htsearch displayed duplicate results anyway. My > next guess is > url_rewrite_rules, but I am unsure how to write the regexes > and if htsearch > will dedupe results with the same URL after rewriting. > > How can I get htsearch to rewrite these URLs and dedupe the > ones that end up > being the same? Some of the URLs are very ugly and would > require complex > regexes. If I cannot do it within the HTDIG framework, I may > have to htdump > indexes created by htdig, post processing the dumpfiles with > a perl script > that munges the URLs as needed and then load and merge the > new indexes. If > that is not possible I may have to munge the search results > on the fly and > not display the dupes (ugh!) > > > Dennis Watson [EMAIL PROTECTED] > UNIX System Administrator Military.com > > > > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click > _______________________________________________ > ht://Dig general mailing list: <[email protected]> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general > ------------------------------------------------------- This SF.Net email is sponsored by the 'Do More With Dual!' webinar happening July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual core and dual graphics technology at this free one hour event hosted by HP, AMD, and NVIDIA. To register visit http://www.hp.com/go/dualwebinar _______________________________________________ ht://Dig general mailing list: <[email protected]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

