thanks for the help. turns out that it was permissions related. it should have worked
with a specific group as the owner but it only worked as nobody. does pdf2html.pl write to itself?
thats the only reason I know of why this would happen. So I got pdf2html working and it indexes everything now
except it is adding extensions to the end of the document title in the search results like .pmd, .p65, .qxd
for pdf files. I need to get this working fast so I added:
 
    $title =~  s/(\.pmd|\.p65|\.qxd)//g;
 
to sub pdf_head from pdf2html.pl
but obviously this isn't a real solution. Does anyone know what might cause this?
I'd like to fix this correctly.
 
thanks
----- Original Message -----
Sent: Wednesday, December 15, 2004 9:26 AM
Subject: Re: [htdig] pdf indexing problems

It looks as though the problem is with /www/htdig/bin/doc2html/pdf2html.pl; certainly doc2html.pl cannot execute it.  It could be permissions, or it could be the first line of pdf2html.pl - check that it gives the correct path to the Perl executable.
 
Because doc2html.pl is unable to run pdf2html.pl it is falling back on reading the PDF file as though it were plain text, hence those strange words you were seeing. 
 
David Adams
Corporate Information Services
Information Systems Services
University of Southampton
----- Original Message -----
Sent: Wednesday, December 15, 2004 2:41 PM
Subject: Re: [htdig] pdf indexing problems

? [application/pdf] Plain Text 190604
!! Unable to execute /www/htdig/bin/doc2html/pdf2html.pl for PDF (pdf2html) document
 
in the log file which would lead me to believe that the permissions are wrong or that
my $PDF2HTML = '/www/htdig/bin/doc2html/pdf2html.pl';
is wrong in doc2html.pl
 
but as far as I know that's not the case. Is there anything else that could be causing this?
 
thanks
----- Original Message -----
Sent: Wednesday, December 15, 2004 3:59 AM
Subject: Re: [htdig] pdf indexing problems

What do you see in the /www/htdig/bin/doc2html/DOC2HTML_LOG file?
 
David Adams
----- Original Message -----
Sent: Tuesday, December 14, 2004 5:19 PM
Subject: [htdig] pdf indexing problems

I posted a question recently about indexing pdfs with doc2html
but I can't figure out what the problem is. I believe that the conifg is correct
but there may be a problem there. when I dig a number of pdfs the files
are read but the words indexed are not correct:
Does anyone know what this indicates?
From looking at the message archives it seems that others have had this problem
but there weren't any solutions posted in the messages
 
my config and output follows. thanks in advance for any help, I appreciate it.
 
in doc2html.pl:
 
$ENV{DOC2HTML_LOG} = '/www/htdig/bin/doc2html/DOC2HTML_LOG';
my $PDF2HTML = '/www/htdig/bin/doc2html/pdf2html.pl';
 
in pdf2html.pl:
 
my $PDFTOTEXT = "/usr/bin/pdftotext";
my $PDFINFO = "/usr/bin/pdfinfo";
 
rundig output:
 
Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 907 from document
Read a total of 361355 bytes
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
 size = 361355
pick: www.flexco.com, # servers = 1
80:358:0:http://www.flexco.com/prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf: Retrieval command for http://www.flexco.com/prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf: GET /prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf HTTP/1.0
Cookie: authorized=true
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
Host: www.flexco.com
 
config file:
 
database_dir:  /www/htdig/db_flexco_new
limit_urls_to:  http://www.flexco.com/
exclude_urls:  /cgi-bin/ .cgi /prod_info/safety.cfm /landing.cfm
bad_extensions:  .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
 .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css #.pdf
maintainer:  [EMAIL PROTECTED]
max_head_length: 10000
max_doc_size:  5000000
no_excerpt_show_top: true
search_algorithm: exact:1 synonyms:0.5 endings:0.1
template_map: Long long ${common_dir}/flexco/long.html \
  Short short ${common_dir}/flexco/short.html
 template_name: long
 search_results_header: ${common_dir}/flexco/header.html
 search_results_footer: ${common_dir}/flexco/footer.html
 #search_results_wrapper: ${common_dir}/flexco/wrapper.html
 nothing_found_file: ${common_dir}/flexco/nomatch.html 
 syntax_error_file: ${common_dir}/flexco/syntax.html
cookie: authorized=true
maximum_pages: 20
external_parsers: application/pdf->text/html /www/htdig/bin/doc2html/doc2html.pl
wordlist_compress: false
wordlist_compress_zlib: false
minimum_word_length: 2
bad_word_list: ${common_dir}/badwords.txt

Reply via email to