----- Original Message -----
Sent: Tuesday, December 14, 2004 5:19
PM
Subject: [htdig] pdf indexing
problems
I posted a question recently about indexing
pdfs with doc2html
but I can't figure out what the problem is. I
believe that the conifg is correct
but there may be a problem there. when I dig a
number of pdfs the files
are read but the words indexed are not
correct:
Does anyone know what
this indicates?
From looking at the message archives it seems
that others have had this problem
but there weren't any solutions
posted in the messages
my config and output follows. thanks in advance
for any help, I appreciate it.
in doc2html.pl:
$ENV{DOC2HTML_LOG} =
'/www/htdig/bin/doc2html/DOC2HTML_LOG';
my $PDF2HTML =
'/www/htdig/bin/doc2html/pdf2html.pl';
in pdf2html.pl:
my $PDFTOTEXT = "/usr/bin/pdftotext";
my
$PDFINFO = "/usr/bin/pdfinfo";
rundig output:
config file:
database_dir: /www/htdig/db_flexco_new
exclude_urls: /cgi-bin/ .cgi
/prod_info/safety.cfm /landing.cfm
bad_extensions: .wav .gz .z .sit .au
.zip .tar .hqx .exe .com .gif \
.jpg .jpeg .aiff .class .map .ram
.tgz .bin .rpm .mpg .mov .avi .css #.pdf
max_head_length: 10000
max_doc_size: 5000000
no_excerpt_show_top: true
search_algorithm: exact:1 synonyms:0.5
endings:0.1
template_map: Long long
${common_dir}/flexco/long.html \
Short short
${common_dir}/flexco/short.html
template_name:
long
search_results_header:
${common_dir}/flexco/header.html
search_results_footer:
${common_dir}/flexco/footer.html
#search_results_wrapper:
${common_dir}/flexco/wrapper.html
nothing_found_file:
${common_dir}/flexco/nomatch.html
syntax_error_file:
${common_dir}/flexco/syntax.html
cookie: authorized=true
maximum_pages: 20
external_parsers: application/pdf->text/html
/www/htdig/bin/doc2html/doc2html.pl
wordlist_compress:
false
wordlist_compress_zlib: false
minimum_word_length: 2
bad_word_list:
${common_dir}/badwords.txt