[htdig] Problems with searching PDF

Tom Wooldridge Fri, 25 Jun 1999 07:20:42 -0700

Hi,

        We are having a bit of a problem on our webserver.  We have a
great deal of PDF content available.  We raised the max_document_size
variable to 500k.  I, of course, added the external parsers line and
configured the parse_doc.pl script to parse PDF files.  Here is the output
I get after running htdig.

intranet01:/opt/www/htdig/bin # ./htdig -vvvv -i
<output trimmed>

---- notice that the URL is rejected here ---
url rejected: (level 1)file://tc/vol1/ol_prod/mainmenu.pdf
word: requires@676
word: acrobat@681
word: reader@684
Tag: /p>, matched -1
Tag: p>, matched -1
Tag: img src="image15.gif" width="16" height="17">, matched 18
image: http://intranet01/departments/mortgageloan/image15.gif
Tag: font
face="Arial">, matched -1
Tag: /font>, matched -1
Tag: a href="branch/tclist.htm">, matched 2
A tag: pos = 2, position = ="branch/tclist.htm">
word: list@732
word: branches@736
word: thin@740
word: client@742
Tag: /a>, matched 3
href: http://intranet01/departments/mortgageloan/branch/tclist.htm (List
of Branches on Thin Client)
resolving 'http://intranet01/departments/mortgageloan/branch/tclist.htm'


This pattern continues for all pdf file that the search engine encounters.
I am unable to get any further debugging output, so I am unable to
investigate further..

Any help is greatly appreciated.
Tom Wooldridge

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
[htdig] Problems with searching PDF

Reply via email to