According to Bobby Mitchell: > I want ht://Dig to allow searches on pdf documents only. I have tried to > use exclude_urls to exclude .html and .jsp files, but I have some urls > that point to a directory and the index.html file is served. How can I > do this?
If you exclude all HTML files, then how is htdig supposed to find all the links to the PDF files? If you already have a complete list of URLs for all PDF files, then you can feed that into htdig by setting start_url to that list, setting hop_count to 0 (not that it really matters), and then htdig will limit itself to just those URLs. See http://www.htdig.org/FAQ.html#q5.25 for a technique for generating start_url lists fairly automatically using the find command. If you want htdig to spider through the HTML looking for links, but not index the HTML files, you could add an external converter for HTML files that would add in a <meta name="robots" content="noindex,follow"> tag. E.g.: external_parsers: application/pdf->text/html-internal /path/to/doc2html.pl \ text/html->text/html-internal /path/to/addnoindex.sh where addnoindex.sh would be this simple shell script: #!/bin/sh echo '<meta name="robots" content="noindex,follow">' cat "$1" I think for this trick to work reliably, you'd need to upgrade to the 3.1.6 release of htdig, because older versions had a problem with the HTML parser turning indexing back on at inappropriate spots. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

