you may want to double-check the permissions and ownership of pdf2text and pdfinfo.
I had the same problem and realized those files were owned by root (had to install them as root) and thus the web server could not use them until I changed ownership...just a suggestion


On Wed, 15 Dec 2004 08:41:21 -0600, Jon Sorensen <[EMAIL PROTECTED]> wrote:

? [application/pdf] Plain Text 190604
!! Unable to execute /www/htdig/bin/doc2html/pdf2html.pl for PDF (pdf2html) document


in the log file which would lead me to believe that the permissions are wrong or that
my $PDF2HTML = '/www/htdig/bin/doc2html/pdf2html.pl';
is wrong in doc2html.pl


but as far as I know that's not the case. Is there anything else that could be causing this?

thanks
  ----- Original Message -----
  From: David Adams
  To: Jon Sorensen ; [EMAIL PROTECTED]
  Sent: Wednesday, December 15, 2004 3:59 AM
  Subject: Re: [htdig] pdf indexing problems


What do you see in the /www/htdig/bin/doc2html/DOC2HTML_LOG file?

  David Adams
    ----- Original Message -----
    From: Jon Sorensen
    To: [EMAIL PROTECTED]
    Sent: Tuesday, December 14, 2004 5:19 PM
    Subject: [htdig] pdf indexing problems


I posted a question recently about indexing pdfs with doc2html
but I can't figure out what the problem is. I believe that the conifg is correct
but there may be a problem there. when I dig a number of pdfs the files
are read but the words indexed are not correct:
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Does anyone know what this indicates?
From looking at the message archives it seems that others have had this problem
but there weren't any solutions posted in the messages


my config and output follows. thanks in advance for any help, I appreciate it.

    in doc2html.pl:

    $ENV{DOC2HTML_LOG} = '/www/htdig/bin/doc2html/DOC2HTML_LOG';
    my $PDF2HTML = '/www/htdig/bin/doc2html/pdf2html.pl';

    in pdf2html.pl:

    my $PDFTOTEXT = "/usr/bin/pdftotext";
    my $PDFINFO = "/usr/bin/pdfinfo";

    rundig output:

Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 907 from document
Read a total of 361355 bytes
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
size = 361355
pick: www.flexco.com, # servers = 1
80:358:0:http://www.flexco.com/prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf: Retrieval command for http://www.flexco.com/prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf: GET /prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf HTTP/1.0
Cookie: authorized=true
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
Host: www.flexco.com



config file:

database_dir: /www/htdig/db_flexco_new
start_url: http://www.flexco.com/index.cfm
limit_urls_to: http://www.flexco.com/
exclude_urls: /cgi-bin/ .cgi /prod_info/safety.cfm /landing.cfm
bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css #.pdf
maintainer: [EMAIL PROTECTED]
max_head_length: 10000
max_doc_size: 5000000
no_excerpt_show_top: true
search_algorithm: exact:1 synonyms:0.5 endings:0.1
template_map: Long long ${common_dir}/flexco/long.html \
Short short ${common_dir}/flexco/short.html
template_name: long
search_results_header: ${common_dir}/flexco/header.html
search_results_footer: ${common_dir}/flexco/footer.html
#search_results_wrapper: ${common_dir}/flexco/wrapper.html
nothing_found_file: ${common_dir}/flexco/nomatch.html
syntax_error_file: ${common_dir}/flexco/syntax.html
cookie: authorized=true
maximum_pages: 20
external_parsers: application/pdf->text/html /www/htdig/bin/doc2html/doc2html.pl
wordlist_compress: false
wordlist_compress_zlib: false
minimum_word_length: 2
bad_word_list: ${common_dir}/badwords.txt





-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to