Hi. I'm looking for some help regarding PDF files with htdig on an intranet server. I first set up htdig 3.1.6 on a Mac OS X Server running Apache 1.3.26, without any PDF support. Rundig indexed all files without problem. I then tried to add PDF file support:
1. Installed xpdf-2.00 to add the pdfinfo and pdftotext utilities to /usr/local/bin 2. Added doc2html.pl and pdf2html.pl to /usr/local/bin (and made them executable). 3. Modified the following line in doc2html.pl: my $PDF2HTML = '/usr/local/bin/pdf2html.pl'; 4. Modified the following in pdf2html.pl: my $PDFTOTEXT = "/usr/local/bin/pdftotext"; my $PDFINFO = "/usr/local/bin/pdfinfo"; 5. Added the following to htdig.conf: external_parsers: application/pdf->text/html /usr/local/bin/doc2html.pl I've not had any luck getting PDF files indexed with this setup. So far in troubleshooting, I've found the following: a. Both pdfinfo and pdftotext seem to work when executed on a PDF file. b. Executing '/usr/local/bin/pdf2html.pl' on a PDF file generates appropriate output. c. Executing '/usr/local/bin/doc2html.pl' gives the following error, which I don't understand: ! UNABLE to convert d. After setting 'start_url' in htdig.conf to a directory containing only PDF files (10 of them), the following is the output of 'rundig -vvv': ---------------- New server: 1.0.20.78, 80 Retrieval command for http://1.0.20.78/robots.txt: GET /robots.txt HTTP/1.0 User-Agent: htdig/3.1.6 ([EMAIL PROTECTED]) Host: 1.0.20.78 Header line: HTTP/1.1 404 Not Found Header line: Date: Fri, 17 Jan 2003 19:15:31 GMT Header line: Server: Apache/1.3.27 (Darwin) PHP/4.1.2 Header line: Connection: close Header line: Content-Type: text/html; charset=iso-8859-1 Header line: returnStatus = 1 pushed pick: 1.0.20.78, # servers = 1 0:0:0:http://1.0.20.78/test_pdfs/: Retrieval command for http://1.0.20.78/test_pdfs/: GET /test_pdfs/ HTTP/1.0 User-Agent: htdig/3.1.6 ([EMAIL PROTECTED]) Host: 1.0.20.78 Header line: HTTP/1.1 403 Forbidden Header line: Date: Fri, 17 Jan 2003 19:15:31 GMT Header line: Server: Apache/1.3.27 (Darwin) PHP/4.1.2 Header line: Connection: close Header line: Content-Type: text/html; charset=iso-8859-1 Header line: returnStatus = 1 not found pick: 1.0.20.78, # servers = 1 htmerge: Sorting... htmerge: Removing doc #0 DB2 problem...: missing or empty key value specified Deleted, no excerpt: 0/http://1.0.20.78/test_pdfs/ ---------------- Thanks for any help! -Jason Morse [EMAIL PROTECTED] ------------------------------------------------------- This SF.NET email is sponsored by: Thawte.com - A 128-bit supercerts will allow you to extend the highest allowed 128 bit encryption to all your clients even if they use browsers that are limited to 40 bit encryption. Get a guide here:http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0030en _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

