Hi,
I am trying to parse PDF documents but htdig doesn't parse the contents. I am only getting the File name as a result of the search

doc2html parses these files properly when run from commandline. But with htdig it doesn't. Can someone let me know what the problem is?

My htdig.conf file is
----------------------------

database_dir:           /var/lib/htdig
start_url:      http://MySite/PostNuke/html/Downloads/
limit_urls_to:          ${start_url}
exclude_urls:           /cgi-bin/ .cgi  C=D C=M C=N C=S O=A O=D
bad_extensions:         .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
        .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css

maintainer:     [EMAIL PROTECTED]
max_head_length:        10000
max_doc_size:           1000000
no_excerpt_show_top:    true
search_algorithm:       exact:1 synonyms:0.5 endings:0.1
external_parsers:
application/rtf->text/html /var/www/html/doc2html/doc2html.pl \
text/rtf->text/html /var/www/html/doc2html/doc2html.pl \
application/pdf->text/html /var/www/html/doc2html/doc2html.pl \
application/postscript->text/html /var/www/html/doc2html/doc2html.pl \
application/msword->text/html /var/www/html/doc2html/doc2html.pl \
application/msexcel->text/html /var/www/html/doc2html/doc2html.pl \
application/vnd.ms-excel->text/html /var/www/html/doc2html/doc2html.pl \
application/vnd.ms-powerpoint->text/html /var/www/html/doc2html/doc2html.pl \

----------------------------

Output of $htdig -vvvv
        0:1:http://mysite/PostNuke/html/Downloads/
New server: mysite, 80
Retrieval command for http://mysite/robots.txt: GET /robots.txt HTTP/1.0^M
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])^M
Host: mysite^M
^M
Header line: HTTP/1.1 404 Not Found
Header line: Date: Thu, 15 Apr 2004 10:19:51 GMT
Header line: Server: Apache/2.0.40 (Red Hat Linux)
Header line: Vary: accept-language
Header line: Accept-Ranges: bytes
Header line: Content-Length: 1066
Header line: Connection: close
Header line: Content-Type: text/html; charset=ISO-8859-1
Header line: Expires: Thu, 15 Apr 2004 10:19:51 GMT
Header line:
returnStatus = 1
 pushed
        0:1:http://mysite/PostNuke/html/Downloads/Test.pdf pushed
        1:1:http://mysite/PostNuke/html/Downloads/ skipped
pick: mysite, # servers = 1
0:2:0:http://mysite/PostNuke/html/Downloads/: Retrieval command for http://mysite/PostNuke/html/Downloads/: GET /PostNuke/html/Downloads/ HTTP/1.0^M
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])^M
If-Modified-Since: Thu, 15 Apr 2004 10:19:34 GMT^M
Host: mysite^M
^M
Header line: HTTP/1.1 200 OK
Header line: Date: Thu, 15 Apr 2004 10:19:51 GMT
Header line: Server: Apache/2.0.40 (Red Hat Linux)
Header line: Content-Length: 736
Header line: Connection: close
Header line: Content-Type: text/html; charset=ISO-8859-1
Header line:
returnStatus = 0
Read 736 from document
Read a total of 736 bytes
 (changed) Tag: <html>, matched -1
Tag: <head>, matched -1
Tag: <title>, matched 0
word: [EMAIL PROTECTED]
word: PostNuke/html/[EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: PostNuke/[EMAIL PROTECTED]
word part: html/[EMAIL PROTECTED]
Tag: </title>, matched 1

title: Index of /PostNuke/html/Downloads
Tag: </head>, matched -1
Tag: <body>, matched -1
Tag: <h1>, matched 4
word: [EMAIL PROTECTED]
word: PostNuke/html/[EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: PostNuke/[EMAIL PROTECTED]
word part: html/[EMAIL PROTECTED]
Tag: </h1>, matched 10
Tag: <pre>, matched -1
Tag: <img src="" alt="Icon " />, matched 18
word: [EMAIL PROTECTED]
image: http://mysite/icons/blank.gif
Tag: <a href="" matched 2
word: [EMAIL PROTECTED]
Tag: </a>, matched 3
href: http://mysite/PostNuke/html/Downloads/?C=N&O=D (Name)

  Rejected: Item in the exclude list: item # 5 length: 3

url rejected: (level 1)http://mysite/PostNuke/html/Downloads/?C=N&O=D
Tag: <a href="" matched 2
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Tag: </a>, matched 3
href: http://mysite/PostNuke/html/Downloads/?C=M&O=A (Last modified)

  Rejected: Item in the exclude list: item # 4 length: 3

url rejected: (level 1)http://mysite/PostNuke/html/Downloads/?C=M&O=A
Tag: <a href="" matched 2
word: [EMAIL PROTECTED]
Tag: </a>, matched 3
href: http://mysite/PostNuke/html/Downloads/?C=S&O=A (Size)

  Rejected: Item in the exclude list: item # 6 length: 3

url rejected: (level 1)http://mysite/PostNuke/html/Downloads/?C=S&O=A
Tag: <a href="" matched 2
word: [EMAIL PROTECTED]
Tag: </a>, matched 3
Tag: </a>, matched 3
href: http://mysite/PostNuke/html/Downloads/?C=D&O=A (Description)

  Rejected: Item in the exclude list: item # 3 length: 3

url rejected: (level 1)http://mysite/PostNuke/html/Downloads/?C=D&O=A
Tag: <hr />, matched -1
Tag: <img src="" alt="[DIR]" />, matched 18
word: [EMAIL PROTECTED]
image: http://mysite/icons/back.gif
Tag: <a href="" matched 2
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Tag: </a>, matched 3
href: http://mysite/PostNuke/html/ (Parent Directory)

   Rejected: URL not in the limits!
url rejected: (level 1)http://mysite/PostNuke/html/
Tag: <img src="" alt="[   ]" />, matched 18
image: http://mysite/icons/layout.gif
Tag: <a href="" matched 2
word: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
Tag: </a>, matched 3
href: http://mysite/PostNuke/html/Downloads/Test.pdf (Test.pdf)
resolving 'http://mysite/PostNuke/html/Downloads/Test.pdf'
*word: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Tag: <hr />, matched -1
Tag: </pre>, matched -1
Tag: <address>, matched -1
word: Apache/[EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: Apache/[EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: Apache/[EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Tag: </address>, matched -1
Tag: </body>, matched -1
Tag: </html>, matched -1
 size = 736
pick: mysite, # servers = 1
1:3:1:http://mysite/PostNuke/html/Downloads/Test.pdf: Retrieval command for http://mysite/PostNuke/html/Downloads/Test.pdf: GET /PostNuke/html/Downloads/Test.pdf HTTP/1.0^M
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])^M
If-Modified-Since: Thu, 15 Apr 2004 08:14:43 GMT^M
Host: mysite^M
^M
Header line: HTTP/1.1 304 Not Modified
Header line: Date: Thu, 15 Apr 2004 10:19:51 GMT
Header line: Server: Apache/2.0.40 (Red Hat Linux)
Header line: Connection: close
Header line: ETag: "4e244-25b54-aff3c2c0"
Header line:
returnStatus = 2
 not changed
pick: mysite, # servers = 1
-----------------------------------
A part of the $rundig -vvvv output
--------------------
Read 8192 from document
Read 6996 from document
Read a total of 154452 bytes                  // The file size is correct
PDF::setContents(154452 bytes)
PDF::parse(http://172.17.127.60/PostNuke/html/Downloads/Test.pdf)
PDF::parseNonTextLine: title is "Capability_4_1_June2002.PDF"
.
.
.
title: Capability_4_1_June2002.PDF
PDF::parseNonTextLine: total pages is 14
PDF::parseNonTextLine: start page 1
PDF::parseNonTextLine: begin text block
PDF::parseTextLine("297.59999 732.23999 TD") cmd=TD
PDF::parseTextLine("0 0 0 rg") cmd=rg
PDF::parseTextLine("/N6 28.07998 Tf") cmd=Tf
.
.
.
PDF::parseTextLine("0 Tc") cmd=Tc
PDF::parseTextLine("0.13198 Tw") cmd=Tw
PDF::parseTextLine("( )Tj ") cmd=
PDF::parseTextLine("(EXECUTIVE SUMMARY)Tj ") cmd=
PDF::parseTextLine("114.95999 0 TD") cmd=TD
PDF::parseTextLine("-0.00479 Tc") cmd=Tc
PDF::parseTextLine("0 Tw") cmd=Tw
PDF::parseTextLine("(................................)Tj ") cmd=
PDF::parseTextLine("99.83999 0 TD") cmd=TD
PDF::parseTextLine("(................................)Tj ") cmd=
PDF::parseTextLine("99.83999 0 TD") cmd=TD
PDF::parseTextLine("(................................)Tj ") cmd=
PDF::parseTextLine("99.83999 0 TD") cmd=TD
PDF::parseTextLine("(..)Tj ") cmd=
.
.
.
.
PDF::parseTextLine("ET") cmd=ET
PDF::parse: head = ""
PDF::parse: 83919 lines parsed
PDF::parse ends normally
 size = 154452
pick: 172.17.127.60, # servers = 1
htmerge: Sorting...
htmerge: Merging...

0/http://mysite/PostNuke/html/Downloads/
Deleted, no excerpt: 1/http://mysite/PostNuke/html/Downloads/Test.pdf

Thanks,
Neha Verma

DISCLAIMER: The information contained in this message is intended only and solely for 
the addressed individual or entity indicated in this message and for the exclusive use 
of the said addressed individual or entity indicated in this message (or responsible 
for delivery
of the message to such person) and may contain legally privileged and confidential 
information belonging to Tata Consultancy Services. It must not be printed, read, 
copied, disclosed, forwarded, distributed or used (in whatsoever manner) by any person 
other than the addressee. 
Unauthorized use, disclosure or copying is strictly prohibited and may constitute 
unlawful act and can possibly attract legal action, civil and/or criminal. The 
contents of this message need not necessarily reflect or endorse the views of Tata 
Consultancy Services on any subject matter.
Any action taken or omitted to be taken based on this message is entirely at your risk 
and neither the originator of this message nor Tata Consultancy Services takes any 
responsibility or liability towards the same. Opinions, conclusions and any other 
information contained in this message 
that do not relate to the official business of Tata Consultancy Services shall be 
understood as neither given nor endorsed by Tata Consultancy Services or any affiliate 
of Tata Consultancy Services. If you have received this message in error, you should 
destroy this message and may please notify the sender by e-mail. Thank you.

Reply via email to