Hi.  I'm looking for some help regarding PDF files with htdig on an
intranet server.  I first set up htdig 3.1.6 on a Mac OS X Server running
Apache 1.3.26, without any PDF support.  Rundig indexed all files without
problem. I then tried to add PDF file support:

1. Installed xpdf-2.00 to add the pdfinfo and pdftotext utilities to
/usr/local/bin

2. Added doc2html.pl and pdf2html.pl to /usr/local/bin (and made them
executable).

3. Modified the following line in doc2html.pl:
   my $PDF2HTML = '/usr/local/bin/pdf2html.pl';

4. Modified the following in pdf2html.pl:
   my $PDFTOTEXT = "/usr/local/bin/pdftotext";
   my $PDFINFO = "/usr/local/bin/pdfinfo";

5. Added the following to htdig.conf:
   external_parsers:  application/pdf->text/html /usr/local/bin/doc2html.pl

I've not had any luck getting PDF files indexed with this setup.  

So far in troubleshooting, I've found the following:
a. Both pdfinfo and pdftotext seem to work when executed on a PDF file.
b. Executing '/usr/local/bin/pdf2html.pl' on a PDF file generates
appropriate output.
c. Executing '/usr/local/bin/doc2html.pl' gives the following error, which
I don't understand:
   !       UNABLE to convert
d. After setting 'start_url' in htdig.conf to a directory containing only
PDF files (10 of them), the following is the output of 'rundig -vvv':
   ----------------
   New server: 1.0.20.78, 80
   Retrieval command for http://1.0.20.78/robots.txt: GET /robots.txt
HTTP/1.0
   User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
   Host: 1.0.20.78
   
   Header line: HTTP/1.1 404 Not Found
   Header line: Date: Fri, 17 Jan 2003 19:15:31 GMT
   Header line: Server: Apache/1.3.27 (Darwin) PHP/4.1.2
   Header line: Connection: close
   Header line: Content-Type: text/html; charset=iso-8859-1
   Header line: 
   returnStatus = 1
    pushed
   pick: 1.0.20.78, # servers = 1
   0:0:0:http://1.0.20.78/test_pdfs/: Retrieval command for
http://1.0.20.78/test_pdfs/: GET /test_pdfs/ HTTP/1.0
   User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
   Host: 1.0.20.78
   
   Header line: HTTP/1.1 403 Forbidden
   Header line: Date: Fri, 17 Jan 2003 19:15:31 GMT
   Header line: Server: Apache/1.3.27 (Darwin) PHP/4.1.2
   Header line: Connection: close
   Header line: Content-Type: text/html; charset=iso-8859-1
   Header line: 
   returnStatus = 1
    not found
   pick: 1.0.20.78, # servers = 1
   htmerge: Sorting...
   htmerge: Removing doc #0
   DB2 problem...: missing or empty key value specified
   
   Deleted, no excerpt: 0/http://1.0.20.78/test_pdfs/
   ----------------
   
Thanks for any help!
-Jason Morse
[EMAIL PROTECTED]




-------------------------------------------------------
This SF.NET email is sponsored by: Thawte.com - A 128-bit supercerts will
allow you to extend the highest allowed 128 bit encryption to all your 
clients even if they use browsers that are limited to 40 bit encryption. 
Get a guide here:http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0030en
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to