Hi, I'm hoping to change from another search engine (which doesn't scale so well) to htdig. There are a couple of things I could do in my old search engine that I'm not sure if I can do or not in htdig:
1. The site to index has many urls mainly distinguished by cgi parameters (eg. archive.html?category=lessons&article=135) One article can appear in several categories; is it possible simply to index the first page the crawler finds for article135, then fool it into thinking that all other urls with the same article id (eg. archive.html?category=teaching&article=135) have the same url, so the same article doesn't get indexed repeatedly? 2. The pages have a lot of boiler plate, which shouldn't be indexed. Content which should be indexed begins after a comment <!-- content start --> and ends with the comment <!-- content end --> Is it possible to get htdig to recognise the comments (this one looks like it shouldn't be too hard to do by tweaking HTML.cc, but maybe it can be done in the configuration files?) Thanks for any advice Graham ------------------------------------------------------- This sf.net email is sponsored by: To learn the basics of securing your web site with SSL, click here to get a FREE TRIAL of a Thawte Server Certificate: http://www.gothawte.com/rd522.html _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

