The trick to getting the PDF parsers to work properly is by testing them from the command line (in other words, take htdig out of the equation). If they output text, then the problem is not with the parsers but rather with some aspect of htdig.

That said, here is how our intranet search engine server is configured:

external_parsers: application/msword->text/html /usr/local/bin/conv_doc.pl \
                             application/postscript->text/html 
/usr/local/bin/conv_doc.pl \
                             application/pdf->text/html 
/usr/local/bin/conv_doc.pl

and then in conv_doc.pl I have the following (abbreviating):
$CATDOC = "/usr/local/bin/catdoc";
$CATPS = "/usr/bin/ps2ascii";
$CATPDF = "/usr/local/bin/pdftotext";
$PDFINFO = "/usr/local/bin/pdfinfo";

and then test just the conv_doc.pl part like this:

./conv_doc.pl /path/to/some/document.pdf

and the text version should be printed out. If not, you know there is a problem either with conv_doc.pl or with the conversion utility (/usr/local/bin/pdftotext) itself (which can also be tested directly).

Good luck.

Ted Stresen-Reuter

On Jul 31, 2005, at 8:19 PM, Robert Isaac wrote:

I am setting up a new ProLiant DL360 G4 server with Red Hat ES Linux 4 and Apache 2.0.x.

I had copied over htdig 3.1.6 from the old server, but decided to install 3.2.0b6 with the view of using it when the server goes live in a few days. What a nightmare.

The htdig web site ( http://www.htdig.org/dev/htdig-3.2/) is ambiguous about 3.2.0b6 and PDF indexing. In the FAQ 1.13 it refers to FAQ 4.9. I have the xpdf package installed, used it with 3.1.6. When I indexed our web site - 3200 pages half of them PDF's - it took over 13 hours - yes thirteen hours!! And then it deleted every one of the PDF's. That was using:

external_parsers: application/pdf->text/html /var/www/cgi-bin/doc2html.pl

 in htdig.conf.

I also tried acroconv.pl but it didn't work at all.

 I would appreciate some help with this.

 Thanks

 Bob

 [EMAIL PROTECTED]





-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to