Hello,
I am using HTDIG 3.1.5 on Redhat 7.0, and am having problems indexing PDF
files. I have included my config & -vv output below. I have no robots.txt
file, and my max_doc_size is now 10M (one test .pdf file is only 27K and it
also fails), as well as not rejecting pdf as an extension.
I am using the latest xpdf with pdftotext, as well as the latest parse_doc
and conv_doc scripts.
I can manually pdftotext the pdf files and they do contain real text, not
just images, I can also run parse_doc and conv_doc.plthey produce proper
text. WHen I do a rundig, I get a 'URL rejected' message, I do not know
why, this (I presume) leads to a Deleted No Excerpt message and the file (or
any pdf file) is not indexed. Any suggestions??
Regards,
Tony
___________BELOW is my CONFIG ________
external_parsers: application/msword /usr/bin/parse_doc.pl \
application/postscript /usr/bin/parse_doc.pl \
application/pdf /usr/bin/parse_doc.pl
database_dir: /data/software/htdigdb
local_urls: http://80.1.1.4/=/var/www/html/
start_url: http://80.1.1.4/htdig/
limit_urls_to: ${start_url}
exclude_urls: /cgi-bin/ .cgi
bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif
.iso\
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov
.avi
maintainer: [EMAIL PROTECTED]
max_head_length: 50000
max_doc_size: 10000000
no_excerpt_show_top: true
search_algorithm: exact:1 synonyms:0.5 endings:0.1
no_next_page_text:
no_prev_page_text:
____________Below is output of rundig -vv using 2 pdf files and 1 txt and
files ______
New server: 80.1.1.4, 80
Trying local files
tried local file /var/www/html/robots.txt
Local retrieval failed, trying HTTP
pick: 80.1.1.4, # servers = 1
0:0:0:http://80.1.1.4/htdig/mx59pro/manual/english/: Trying local files
tried local file /var/www/html/htdig/mx59pro/manual/english/index.html
Local retrieval failed, trying HTTP
title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=D">
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?N=D
+A tag: pos = 2, position = ="?M=A">
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?M=A
+A tag: pos = 2, position = ="?S=A">
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?S=A
+A tag: pos = 2, position = ="?D=A">
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?D=A
+A tag: pos = 2, position = ="/htdig/mx59pro/manual/">
url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.pdf">
pushing http://80.1.1.4/htdig/mx59pro/manual/english/content.pdf
+A tag: pos = 2, position = ="content.txt">
pushing http://80.1.1.4/htdig/mx59pro/manual/english/content.txt
+A tag: pos = 2, position = ="sonic.pdf">
pushing http://80.1.1.4/htdig/mx59pro/manual/english/sonic.pdf
+ size = 954
pick: 80.1.1.4, # servers = 1
1:1:1:http://80.1.1.4/htdig/mx59pro/manual/english/?N=D: Trying local files
tried local file /var/www/html/htdig/mx59pro/manual/english/?N=D
Local retrieval failed, trying HTTP
title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?N=A
+A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">
url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="sonic.pdf">
*A tag: pos = 2, position = ="content.txt">
*A tag: pos = 2, position = ="content.pdf">
* size = 954
pick: 80.1.1.4, # servers = 1
2:2:1:http://80.1.1.4/htdig/mx59pro/manual/english/?M=A: Trying local files
tried local file /var/www/html/htdig/mx59pro/manual/english/?M=A
Local retrieval failed, trying HTTP
title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">
*A tag: pos = 2, position = ="?M=D">
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?M=D
+A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">
url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.pdf">
*A tag: pos = 2, position = ="sonic.pdf">
*A tag: pos = 2, position = ="content.txt">
* size = 954
pick: 80.1.1.4, # servers = 1
3:3:1:http://80.1.1.4/htdig/mx59pro/manual/english/?S=A: Trying local files
tried local file /var/www/html/htdig/mx59pro/manual/english/?S=A
Local retrieval failed, trying HTTP
title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">
*A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=D">
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?S=D
+A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">
url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.txt">
*A tag: pos = 2, position = ="content.pdf">
*A tag: pos = 2, position = ="sonic.pdf">
* size = 954
pick: 80.1.1.4, # servers = 1
4:4:1:http://80.1.1.4/htdig/mx59pro/manual/english/?D=A: Trying local files
tried local file /var/www/html/htdig/mx59pro/manual/english/?D=A
Local retrieval failed, trying HTTP
title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">
*A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=D">
pushing http://80.1.1.4/htdig/mx59pro/manual/english/?D=D
+A tag: pos = 2, position = ="/htdig/mx59pro/manual/">
url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.pdf">
*A tag: pos = 2, position = ="content.txt">
*A tag: pos = 2, position = ="sonic.pdf">
* size = 954
pick: 80.1.1.4, # servers = 1
5:5:1:http://80.1.1.4/htdig/mx59pro/manual/english/content.pdf: Trying local
files
found existing file /var/www/html/htdig/mx59pro/manual/english/content.pdf
size = 6705
pick: 80.1.1.4, # servers = 1
6:6:1:http://80.1.1.4/htdig/mx59pro/manual/english/content.txt: Trying local
files
found existing file /var/www/html/htdig/mx59pro/manual/english/content.txt
size = 115
pick: 80.1.1.4, # servers = 1
7:7:1:http://80.1.1.4/htdig/mx59pro/manual/english/sonic.pdf: Trying local
files
found existing file /var/www/html/htdig/mx59pro/manual/english/sonic.pdf
size = 377264
pick: 80.1.1.4, # servers = 1
8:8:2:http://80.1.1.4/htdig/mx59pro/manual/english/?N=A: Trying local files
tried local file /var/www/html/htdig/mx59pro/manual/english/?N=A
Local retrieval failed, trying HTTP
title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=D">
*A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">
url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.pdf">
*A tag: pos = 2, position = ="content.txt">
*A tag: pos = 2, position = ="sonic.pdf">
* size = 954
pick: 80.1.1.4, # servers = 1
9:9:2:http://80.1.1.4/htdig/mx59pro/manual/english/?M=D: Trying local files
tried local file /var/www/html/htdig/mx59pro/manual/english/?M=D
Local retrieval failed, trying HTTP
title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">
*A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">
url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.txt">
*A tag: pos = 2, position = ="sonic.pdf">
*A tag: pos = 2, position = ="content.pdf">
* size = 954
pick: 80.1.1.4, # servers = 1
10:10:2:http://80.1.1.4/htdig/mx59pro/manual/english/?S=D: Trying local
files
tried local file /var/www/html/htdig/mx59pro/manual/english/?S=D
Local retrieval failed, trying HTTP
title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">
*A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">
url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="sonic.pdf">
*A tag: pos = 2, position = ="content.pdf">
*A tag: pos = 2, position = ="content.txt">
* size = 954
pick: 80.1.1.4, # servers = 1
11:11:2:http://80.1.1.4/htdig/mx59pro/manual/english/?D=D: Trying local
files
tried local file /var/www/html/htdig/mx59pro/manual/english/?D=D
Local retrieval failed, trying HTTP
title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">
*A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">
url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="sonic.pdf">
*A tag: pos = 2, position = ="content.txt">
*A tag: pos = 2, position = ="content.pdf">
* size = 954
pick: 80.1.1.4, # servers = 1
htmerge: Sorting...
htmerge: Merging...
0/http://80.1.1.4/htdig/mx59pro/manual/english/
4/http://80.1.1.4/htdig/mx59pro/manual/english/?D=A
11/http://80.1.1.4/htdig/mx59pro/manual/english/?D=D
2/http://80.1.1.4/htdig/mx59pro/manual/english/?M=A
9/http://80.1.1.4/htdig/mx59pro/manual/english/?M=D
8/http://80.1.1.4/htdig/mx59pro/manual/english/?N=A
1/http://80.1.1.4/htdig/mx59pro/manual/english/?N=D
3/http://80.1.1.4/htdig/mx59pro/manual/english/?S=A
10/http://80.1.1.4/htdig/mx59pro/manual/english/?S=D
Deleted, no excerpt:
5/http://80.1.1.4/htdig/mx59pro/manual/english/content.pdf
6/http://80.1.1.4/htdig/mx59pro/manual/english/content.txt
htmerge: 10
Deleted, no excerpt:
7/http://80.1.1.4/htdig/mx59pro/manual/english/sonic.pdf
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>