According to Graham Seaman:
> I'm hoping to change from another search engine (which doesn't scale
> so well) to htdig.
> There are a couple of things I could do in my old search engine that
> I'm not sure if I can do or not in htdig:
> 
> 1. The site to index has many urls mainly distinguished by cgi parameters
> (eg. archive.html?category=lessons&article=135)
> One article can appear in several categories; is it possible simply to
> index the first page the crawler finds for article135, then fool it
> into thinking that all other urls with the same article id
> (eg. archive.html?category=teaching&article=135) have the same url, so the
> same article doesn't get indexed repeatedly?

Yes, you can rewrite parts of URLs to strip out or standardize certain
URL parameters.  See http://www.htdig.org/attrs.html#url_rewrite_rules

> 2. The pages have a lot of boiler plate, which shouldn't be indexed.
> Content which should be indexed begins after a comment <!-- content start
> --> and ends with the comment <!-- content end --> Is it possible to get
> htdig to recognise the comments (this one looks like it shouldn't be too
> hard to do by tweaking HTML.cc, but maybe it can be done in the
> configuration files?)

There are some tricks described in http://www.htdig.org/FAQ.html#q4.15
which may help.  The problem is you can only have one set of values for
noindex_start and noindex_end, so it may be a bit tricky to configure
it for the specific strings you're using.  You can also run all HTML
through an external converter before parsing it, which would allow you
to use perl or sed to strip out all sorts of things from the HTML.
See the examples in http://www.htdig.org/files/contrib/scripts/,
specifically README.geoupdate-ungeoify and ungeoify.sh, as well as
http://www.htdig.org/files/contrib/parsers/unhypermail.sh.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by: Are you worried about 
your web server security? Click here for a FREE Thawte 
Apache SSL Guide and answer your Apache SSL security 
needs: http://www.gothawte.com/rd523.html
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to