I am trying to use htdig 3.1.5 binaries (from UCLA mirror) on an AIX 4.3.3
machine (I do not have compile capabilities on any of our AIX machines).
The htdig product worked as advertised on html files right out of the gate.
We have many pdf files on our site, so I attempted to add options to the
config file to index these pdfs. At present it does not appear that the
contents of pdf files are being indexed (unique words from the pdf files do
not return any hits when I run search.html).
I installed Acrobat Reader 5.0 (free) on the AIX webserver, and tried a
pdf_parser option I found in one of the threads:
pdf_parser:/usr/lpp/Acrobat5/bin/acroread -toPostScript
I also tried using the -pairs option with no change in results. I could
list the contents of my ../db directory and I could see each of the pdf
files being converted to ps format (the ps file would grow larger and larger
and then start on a new file), but the contents were not indexed.
I read another thread that referenced using acroconv.pl with acroread, and I
changed my config file to include an external_parsers line:
external_parsers: application/pdf_>text/html /usr/local/bin/acroconv.pl -f
I added the -f option when the system complained about finding work copies
from the conversion in the /tmp directory. The acroread.pl file has a
remove command for the temp files, but they don't always get removed by the
script. Three temp files are created in the /tmp directory (htdext.#.pdf
htdext.#.ps and htdext.#) while htdig is running for each pdf file being
processed. I am still not seeing the contents of the pdf in the index, but
this looks like I am getting closer to a solution.
I have run the command: htdig -a -i -s -vvv > htdig.log. A piece of the log
is included below. I have changed the config file to index smaller and
smaller portions of our site after I noticed that the acroread conversion
failed on certain pdf documents on our site (I think we have some old
corrupted pdf documents that need some attention). Error messages about
"unexpected characters" appear and the conversion seemed to halt. I have
changed the start_url several times until a run completed without error
messages appearing on the screen.
============================ beginning of htdig.log
===================================
1:0:http://www.sde.state.nm.us/div/ais/
New server: www.sde.state.nm.us, 80
Retrieval command for http://www.sde.state.nm.us/robots.txt: GET /robots.txt
HTTP/1.0User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])Host:
www.sde.state.nm.usHeader line: HTTP/1.0 200 OK
Header line: Date: Fri, 20 Dec 2002 15:17:38 GMT
Header line: Server: Apache/1.1.0
Header line: Content-type: text/plain
Header line: Content-length: 86
Header line: Last-modified: Wed, 13 Nov 2002 14:53:54 GMT
Translated Wed, 13 Nov 2002 14:53:54 GMT to 2002-11-13 14:53:54 (102)
And converted to Wed, 13 Nov 2002 14:53:54
Header line:
returnStatus = 0
Read 86 from document
Read a total of 86 bytes
Parsing robots.txt file using myname = htdig
Robots.txt line: User-agent: *
Found 'user-agent' line: *
Robots.txt line: Disallow: /bd.of.ed
Found 'disallow' line: /bd.of.ed
Robots.txt line: User-agent: pingalink
Found 'user-agent' line: pingalink
Robots.txt line: Disallow: /
Robots.txt line: Disallow: /reta98
Pattern: /bd.of.ed
pushed
pick: www.sde.state.nm.us, # servers = 1
0:0:0:http://www.sde.state.nm.us/div/ais/: Retrieval command for
http://www.sde.state.nm.us/div/ais/: GET /div/ais/ HTTP/1.0User-Agent:
htdig/3.1.5 ([EMAIL PROTECTED])Host: www.sde.state.nm.usHeader line:
HTTP/1.0 200 OK
Header line: Date: Fri, 20 Dec 2002 15:17:38 GMT
Header line: Server: Apache/1.1.0
Header line: Content-type: text/html
Header line:
returnStatus = 0
Read 8171 from document
Read a total of 8171 bytes
pick: www.sde.state.nm.us, # servers = 1
1:1:1:http://www.sde.state.nm.us/div/ais/accred/: Retrieval command for
http://www.sde.state.nm.us/div/ais/accred/: GET /div/ais/accred/
HTTP/1.0User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])Referer:
http://www.sde.state.nm.us/div/ais/Host: www.sde.state.nm.usHeader line:
HTTP/1.0 200 OK
Header line: Date: Fri, 20 Dec 2002 15:17:38 GMT
Header line: Server: Apache/1.1.0
Header line: Content-type: text/html
Header line:
returnStatus = 0
Read 5130 from document
Read a total of 5130 bytes
pushing http://www.sde.state.nm.us/div/ais/accred/hsomanual.html
+A tag: pos = 2, position = ="grievance.html">
href: http://www.sde.state.nm.us/div/ais/accred/grievance.html (Grievance
Procedure for New Mexico School Districts)
resolving 'http://www.sde.state.nm.us/div/ais/accred/grievance.html'
...(many more html lines here)
pushing http://www.sde.state.nm.us/div/ais/assess/dl/SocialStudies.pdf
+A tag: pos = 2, position = ="dl/Writing.pdf" target="_blank">
href: http://www.sde.state.nm.us/div/ais/assess/dl/Writing.pdf (Writing)
resolving 'http://www.sde.state.nm.us/div/ais/assess/dl/Writing.pdf'
.(many more pdf lines here)
pick: www.sde.state.nm.us, # servers = 1
htdig: Run complete
htdig: 1 server seen:
htdig: www.sde.state.nm.us:80 160 documents
htdig: Errors to take note of:
Not found: http://www.sde.state.nm.us/div/ais/lic/leveloflicenses.pdf Ref:
http://www.sde.state.nm.us/div/ais/lic/
...(several broken links listed here)
Not found:
http://www.sde.state.nm.us/div/ais/assessment/dl\checklist.accom.english.lan
g.doc Ref:
http://www.sde.state.nm.us/div/ais/assessment/ell.accomm.assess.10.31.02.htm
==============================end of
htdig.log=====================================
I then ran the htmerge command: htmerge -s -a -vvv > htmerge.log
================================beginning of
htmerge.log=============================
htmerge: Sorting...
htmerge: Removing doc #112
... (13 documents listed here)
htmerge: Removing doc #71
htmerge: Merging...
htmerge: Discarding accommodatio in doc #159
...(many discarding messages)
htmerge: Discarding school in doc #152
htmerge: 3000:version
htmerge: 3100:work
htmerge: Total word count: 3130
0/http://www.sde.state.nm.us/div/ais/
...(mixture of html and pdf files here)
24/http://www.sde.state.nm.us/div/ais/assess/dl/AE.Memo.Electronic.Reporting
.6.pdf
Deleted, no excerpt:
25/http://www.sde.state.nm.us/div/ais/assess/dl/AE.Report.Status.of.Student.
Partic.in.Asmt.Prog.6-02.doc
...(mixture of pdf and html files here)
26/http://www.sde.state.nm.us/div/ais/assess/ell.accomm.assess.10.31.02.htm
Deleted, no excerpt:
70/http://www.sde.state.nm.us/div/ais/assess/ell.accomm.assess.10.31.02_file
s/filelist.xml
...(many pdf files here)
Deleted, no excerpt:
159/http://www.sde.state.nm.us/div/ais/assessment/dl\checklist.accom.english
.lang.doc
153/http://www.sde.state.nm.us/div/ais/assessment/ell.accomm.assess.10.31.02
.htm
Deleted, no excerpt:
158/http://www.sde.state.nm.us/div/ais/assessment/ell.accomm.assess.10.31.02
_files/filelist.xml
...(many files listed here)
Deleted, no excerpt:
42/http://www.sde.state.nm.us/div/ais/data/ads/dl/Field%2092%20Layout.doc
...(many documents listed here)
32/http://www.sde.state.nm.us/div/ais/data/dl/calendarofreports02-3.pdf
...Many pdf files appear here
htmerge: Total documents: 146
htmerge: Total doc db size (in K): 17857
================================end of
htmerge.log=============================
>From my reading it makes sense that a pdf file would be listed by the htdig
process, and then removed by htmerge if it finds no excerpt. Why no excerpt
is the burning question.
I would appreciate guidance of any type as this is my first exposure to
htdig.
Paul Economides
Webmaster
NM Dept of Ed.
[EMAIL PROTECTED]
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html