Hi, folks. There have been two or three requests in the past for a
way to use an external converter or external parser for HTML files, and
also use the internal parser. The reason one might want to do this is
to preprocess all HTML files before parsing.
This weekend, it dawned on me that there's a simple way to do this in the
existing code. It takes advantage of the fact that htdig will only use
the external parser or converter for a given Content-Type if the type of
the document matches fully the type for a specified parser or converter
(i.e. is uses a Dictinary class Find or Exists operator, which uses
strcmp()), whereas for internal parsers only a partial match is needed
(htdig uses mystrncasecmp() for this comparison). What this means is
if you use a type of text/html-internal or text/plain-internal, htdig
will use the internal parser for text/html or text/plain without batting
an eye. So, for example you can use a definition like:
external_parsers: text/html->text/html-internal /usr/local/bin/conv_html.sh
where conv_html.sh is this script:
#!/bin/sh
sed -e 's|\(</*\)[Tt][Ii][Tt][Ll][Ee]>|\1noindex>|g' \
-e 's|\(</*\)[Hh]1>|\1title>|g' "$1"
to preprocess HTML files to ignore the existing <title>, and change the
<h1> heading into a <title>. I tried this with the stock 3.1.5 code and
with the ftp://ftp.ccsf.org/htdig-patches/3.1.5/ExternalParser.1 patch
of January 15th (a backport of the 3.2.0b3 code), so I'm sure it'll work
with the upcoming 3.2.0b3 release as well.
I just thought I'd share this simple little kludge with all of you.
Of course, using this approach would make htdig deal with HTML much more
slowly than if you hacked the htdig/HTML.cc parser code to do what you
want, but some users may prefer this approach anyway.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
Information: http://lists.sourceforge.net/lists/listinfo/htdig-general
FAQ: http://htdig.sourceforge.net/FAQ.html