According to Bobby Mitchell:
> I want ht://Dig to allow searches on pdf documents only. I have tried to 
> use exclude_urls to exclude .html and .jsp files, but I have some urls 
> that point to a directory and the index.html file is served. How can I 
> do this?

If you exclude all HTML files, then how is htdig supposed to find all
the links to the PDF files?  If you already have a complete list of
URLs for all PDF files, then you can feed that into htdig by setting
start_url to that list, setting hop_count to 0 (not that it really
matters), and then htdig will limit itself to just those URLs.

See http://www.htdig.org/FAQ.html#q5.25 for a technique for generating
start_url lists fairly automatically using the find command.

If you want htdig to spider through the HTML looking for links, but not
index the HTML files, you could add an external converter for HTML files
that would add in a <meta name="robots" content="noindex,follow"> tag.

E.g.:

external_parsers: application/pdf->text/html-internal /path/to/doc2html.pl \
        text/html->text/html-internal /path/to/addnoindex.sh

where addnoindex.sh would be this simple shell script:

#!/bin/sh
echo '<meta name="robots" content="noindex,follow">'
cat "$1"

I think for this trick to work reliably, you'd need to upgrade to the
3.1.6 release of htdig, because older versions had a problem with the
HTML parser turning indexing back on at inappropriate spots.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to