Thanks for the tips.  By supplying these input parameters to doc2html.pl,
I verified that it is indeed working correctly when manually executed on a
PDF file.  

However, the 'rundig -vvv' log is still mystifying.  I get the exact same
rundig results whether the target directory is empty or contains several
PDF files.  If I add an HTML file to the directory along with the PDF
files, rundig indicates that the file is indexed, but does not mention the
PDFs.  It seems that the PDFs are being completely ignored.  Here is the
log after running on the directory with 1 HTML and 10 PDFs:
---------
[JaquarTest:WebServer/htdig/bin] admin% sudo ./rundig -vvv
        1:1:http://1.0.20.78/test_pdfs/
New server: 1.0.20.78, 80
Retrieval command for http://1.0.20.78/robots.txt: GET /robots.txt HTTP/1.0
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
Host: 1.0.20.78

Header line: HTTP/1.1 404 Not Found
Header line: Date: Mon, 20 Jan 2003 15:51:32 GMT
Header line: Server: Apache/1.3.27 (Darwin) PHP/4.1.2
Header line: Connection: close
Header line: Content-Type: text/html; charset=iso-8859-1
Header line: 
returnStatus = 1
 pushed
pick: 1.0.20.78, # servers = 1
0:0:0:http://1.0.20.78/test_pdfs/: Retrieval command for
http://1.0.20.78/test_pdfs/: GET /test_pdfs/ HTTP/1.0
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
Host: 1.0.20.78

Header line: HTTP/1.1 200 OK
Header line: Date: Mon, 20 Jan 2003 15:51:32 GMT
Header line: Server: Apache/1.3.27 (Darwin) PHP/4.1.2
Header line: Cache-Control: max-age=60
Header line: Expires: Mon, 20 Jan 2003 15:52:32 GMT
Header line: Last-Modified: Mon, 20 Jan 2003 15:08:04 GMT
Converted Mon, 20 Jan 2003 15:08:04 GMT to Mon, 20 Jan 2003 15:08:04
Header line: ETag: "20039-105-3e2c10d4"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 261
Header line: Connection: close
Header line: Content-Type: text/html
Header line: 
returnStatus = 0
Read 261 from document
Read a total of 261 bytes

title: Test page
 size = 261
pick: 1.0.20.78, # servers = 1
htmerge: Sorting...
htmerge: Merging...

0/http://1.0.20.78/test_pdfs/
---------

I've verified the permissions on the target directory and the files --
they're all read-write for ugo.  I tried adding a "valid_extensions"
attribute to the htdig.conf to explicitly identify ".pdf".  I've used PDF
files from a different source, in case my test PDFs were somehow bad.  Any
other ideas?  I'm happy with htdig so far, but this PDF issue is really
going to limit its usefulness for our intranet as so much of our content
is in this format.

Thanks,
Jason
[EMAIL PROTECTED]


[EMAIL PROTECTED] writes:
>It looks as thought you have done everything right in configuring
>doc2html.
>
>The "!       UNABLE to convert!" message comes from doc2html.pl, it means
>that doc2html.pl went through its list of converters but failed to find
>one
>for .PDF files.  If you execute doc2html.pl on the command line you need
>to
>specify three arguments:
>
>     /usr/local/bin/pdf2html.pl    filename    "application/pdf"    url
>
>
>Your  -vvv log includes:
>
>    Header line: HTTP/1.1 403 Forbidden
>
>This would seem to be the problem.
>
>--
>David Adams
>Information Systems Services
>Southampton University
>
>
>----- Original Message -----
>From: "Jason Morse" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Sent: Friday, January 17, 2003 7:29 PM
>Subject: [htdig] Trouble with PDF setup
>
>
>> Hi.  I'm looking for some help regarding PDF files with htdig on an
>> intranet server.  I first set up htdig 3.1.6 on a Mac OS X Server
>running
>> Apache 1.3.26, without any PDF support.  Rundig indexed all files
>without
>> problem. I then tried to add PDF file support:
>>
>> 1. Installed xpdf-2.00 to add the pdfinfo and pdftotext utilities to
>> /usr/local/bin
>>
>> 2. Added doc2html.pl and pdf2html.pl to /usr/local/bin (and made them
>> executable).
>>
>> 3. Modified the following line in doc2html.pl:
>>    my $PDF2HTML = '/usr/local/bin/pdf2html.pl';
>>
>> 4. Modified the following in pdf2html.pl:
>>    my $PDFTOTEXT = "/usr/local/bin/pdftotext";
>>    my $PDFINFO = "/usr/local/bin/pdfinfo";
>>
>> 5. Added the following to htdig.conf:
>>    external_parsers:  application/pdf->text/html
>/usr/local/bin/doc2html.pl
>>
>> I've not had any luck getting PDF files indexed with this setup.
>>
>> So far in troubleshooting, I've found the following:
>> a. Both pdfinfo and pdftotext seem to work when executed on a PDF file.
>> b. Executing '/usr/local/bin/pdf2html.pl' on a PDF file generates
>> appropriate output.
>> c. Executing '/usr/local/bin/doc2html.pl' gives the following error,
>which
>> I don't understand:
>>    !       UNABLE to convert
>> d. After setting 'start_url' in htdig.conf to a directory containing
>only
>> PDF files (10 of them), the following is the output of 'rundig -vvv':
>>    ----------------
>>    New server: 1.0.20.78, 80
>>    Retrieval command for http://1.0.20.78/robots.txt: GET /robots.txt
>> HTTP/1.0
>>    User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
>>    Host: 1.0.20.78
>>
>>    Header line: HTTP/1.1 404 Not Found
>>    Header line: Date: Fri, 17 Jan 2003 19:15:31 GMT
>>    Header line: Server: Apache/1.3.27 (Darwin) PHP/4.1.2
>>    Header line: Connection: close
>>    Header line: Content-Type: text/html; charset=iso-8859-1
>>    Header line:
>>    returnStatus = 1
>>     pushed
>>    pick: 1.0.20.78, # servers = 1
>>    0:0:0:http://1.0.20.78/test_pdfs/: Retrieval command for
>> http://1.0.20.78/test_pdfs/: GET /test_pdfs/ HTTP/1.0
>>    User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
>>    Host: 1.0.20.78
>>
>>    Header line: HTTP/1.1 403 Forbidden
>>    Header line: Date: Fri, 17 Jan 2003 19:15:31 GMT
>>    Header line: Server: Apache/1.3.27 (Darwin) PHP/4.1.2
>>    Header line: Connection: close
>>    Header line: Content-Type: text/html; charset=iso-8859-1
>>    Header line:
>>    returnStatus = 1
>>     not found
>>    pick: 1.0.20.78, # servers = 1
>>    htmerge: Sorting...
>>    htmerge: Removing doc #0
>>    DB2 problem...: missing or empty key value specified
>>
>>    Deleted, no excerpt: 0/http://1.0.20.78/test_pdfs/
>>    ----------------
>>
>> Thanks for any help!
>> -Jason Morse
>> [EMAIL PROTECTED]
>>
>>
>>
>>
>> -------------------------------------------------------
>> This SF.NET email is sponsored by: Thawte.com - A 128-bit supercerts
>will
>> allow you to extend the highest allowed 128 bit encryption to all your
>> clients even if they use browsers that are limited to 40 bit encryption.
>> Get a guide
>here:http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0030en
>> _______________________________________________
>> htdig-general mailing list <[EMAIL PROTECTED]>
>> To unsubscribe, send a message to
><[EMAIL PROTECTED]> with a subject of
>unsubscribe
>> FAQ: http://htdig.sourceforge.net/FAQ.html
>>
>
>
>
>-------------------------------------------------------
>This SF.NET email is sponsored by: FREE  SSL Guide from Thawte
>are you planning your Web Server Security? Click here to get a FREE
>Thawte SSL guide and find the answers to all your  SSL security issues.
>http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en
>_______________________________________________
>htdig-general mailing list <[EMAIL PROTECTED]>
>To unsubscribe, send a message to
><[EMAIL PROTECTED]> with a subject of
>unsubscribe
>FAQ: http://htdig.sourceforge.net/FAQ.html




-------------------------------------------------------
This SF.NET email is sponsored by: FREE  SSL Guide from Thawte
are you planning your Web Server Security? Click here to get a FREE
Thawte SSL guide and find the answers to all your  SSL security issues.
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to