According to Graham Seaman: > I'm hoping to change from another search engine (which doesn't scale > so well) to htdig. > There are a couple of things I could do in my old search engine that > I'm not sure if I can do or not in htdig: > > 1. The site to index has many urls mainly distinguished by cgi parameters > (eg. archive.html?category=lessons&article=135) > One article can appear in several categories; is it possible simply to > index the first page the crawler finds for article135, then fool it > into thinking that all other urls with the same article id > (eg. archive.html?category=teaching&article=135) have the same url, so the > same article doesn't get indexed repeatedly?
Yes, you can rewrite parts of URLs to strip out or standardize certain URL parameters. See http://www.htdig.org/attrs.html#url_rewrite_rules > 2. The pages have a lot of boiler plate, which shouldn't be indexed. > Content which should be indexed begins after a comment <!-- content start > --> and ends with the comment <!-- content end --> Is it possible to get > htdig to recognise the comments (this one looks like it shouldn't be too > hard to do by tweaking HTML.cc, but maybe it can be done in the > configuration files?) There are some tricks described in http://www.htdig.org/FAQ.html#q4.15 which may help. The problem is you can only have one set of values for noindex_start and noindex_end, so it may be a bit tricky to configure it for the specific strings you're using. You can also run all HTML through an external converter before parsing it, which would allow you to use perl or sed to strip out all sorts of things from the HTML. See the examples in http://www.htdig.org/files/contrib/scripts/, specifically README.geoupdate-ungeoify and ungeoify.sh, as well as http://www.htdig.org/files/contrib/parsers/unhypermail.sh. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by: Are you worried about your web server security? Click here for a FREE Thawte Apache SSL Guide and answer your Apache SSL security needs: http://www.gothawte.com/rd523.html _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

