[htdig] Parsing PDF Files Follow-up

Wayne Fool Fri, 23 Jun 2000 07:06:53 -0700

I sent the message attached in a text file about a week ago (this is
included just so that you can look at it if you want to), having trouble
parsing PDF files.  I tracked the problem down to bad downloads.  For some
reason all the tar.gz files were downloaded as text and all the information
was combined and jumbled into one file.  So I did a binary download, should
have done that first, and everything works great.  

I wanted to ask another question that I couldn't find the answer to in the
archives or FAQ.  Is it possible to have htdig search the key words line a
PDF file's document info section and if so, what format does that line have
to be in (comma delimited, space, etc.)?  Thanks for your help.  I
appreciate it.

Thanks Wayne

 <<pdf_parse.txt>>

I have been working on getting PDF files to index.  So far the going is slow.  I have 
400 PDF files that are in the 20-40k size range.  

My hardware is as follows Pentium 75, 32 MEG. RAM, 1.6 gig HD (500 MEG. free)
It is connected as an Intanet webserver accessible only by people in our office.

I have htdig version 3.1.5.  
I have max_doc_size set to 5000000

I have tried to use parse_doc.pl, conv_doc.pl, and doc2html.pl, all of these give me 
14 consecutive ":=command not found" error messages
a "syntax error near unexpected token '( )' " error messages then finally a message 
stating "line 83: 'parts = ( );"  This is an example of the error messages I get with 
all of the above scripts when I run them manually.  I have checked the location of 
ps2ascii and pdftotext files in the script and they are correct. The script just shuts 
down when run with rundig -vvv

I have also tried acroread.  It parses the PDF's and says that it reads them, but 
htmerge discards them.  I know there is text in the title, which is what I need for it 
to index I can see that in the postscript file after acroread is finished (when run 
manually)

Following is an excerpt from the command rundig -vvv using acroread:
pick: labweb1, # servers = 1
37:37:3:http://labweb1/pdf/2000001.pdf: Trying local files
  found existing file /home/httpd/html/pdf/2000001.pdf
Read 8192 from document
Read 8192 from document
Read 2218 from document
Read a total of 18602 bytes
PDF::setContents(18602 bytes)
PDF::parse(http://labweb1/pdf/2000001.pdf)

title: P3480 Eclipse Flush Mount
PDF::parse: 5095 lines parsed
PDF::parse ends normally
 size = 18602

It looks like it is reading the title, is there a way to index those words along with 
5095 lines of text. I don't get a file returned from the search when I search on any 
of the words in the file.  

This is the applicable part of the htdig.conf file:

# These attributes allow indexing server via local filesystem rather than HTTP.
local_urls:     http://labweb1/=/home/httpd/html/
local_user_urls:        http://labweb1/=/home/,/public_html/
pdf_parser: /bin/acroread -toPostScript -pairs

#external_parsers: application/msword /usr/local/bin/parse_doc.pl \
                  application/postscript /usr/local/bin/parse_doc.pl \
                  application/pdf /usr/local/bin/parse_doc.pl

I would appreciate it if you could point me in the right direction.  This is driving 
me nuts.  If I need to provide any further information, I would be glad to.  TIA, I 
appreciated it.

Wayne

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.

[htdig] Parsing PDF Files Follow-up

Reply via email to