I sent the message attached in a text file about a week ago (this is included just so that you can look at it if you want to), having trouble parsing PDF files. I tracked the problem down to bad downloads. For some reason all the tar.gz files were downloaded as text and all the information was combined and jumbled into one file. So I did a binary download, should have done that first, and everything works great. I wanted to ask another question that I couldn't find the answer to in the archives or FAQ. Is it possible to have htdig search the key words line a PDF file's document info section and if so, what format does that line have to be in (comma delimited, space, etc.)? Thanks for your help. I appreciate it. Thanks Wayne <<pdf_parse.txt>>
I have been working on getting PDF files to index. So far the going is slow. I have 400 PDF files that are in the 20-40k size range. My hardware is as follows Pentium 75, 32 MEG. RAM, 1.6 gig HD (500 MEG. free) It is connected as an Intanet webserver accessible only by people in our office. I have htdig version 3.1.5. I have max_doc_size set to 5000000 I have tried to use parse_doc.pl, conv_doc.pl, and doc2html.pl, all of these give me 14 consecutive ":=command not found" error messages a "syntax error near unexpected token '( )' " error messages then finally a message stating "line 83: 'parts = ( );" This is an example of the error messages I get with all of the above scripts when I run them manually. I have checked the location of ps2ascii and pdftotext files in the script and they are correct. The script just shuts down when run with rundig -vvv I have also tried acroread. It parses the PDF's and says that it reads them, but htmerge discards them. I know there is text in the title, which is what I need for it to index I can see that in the postscript file after acroread is finished (when run manually) Following is an excerpt from the command rundig -vvv using acroread: pick: labweb1, # servers = 1 37:37:3:http://labweb1/pdf/2000001.pdf: Trying local files found existing file /home/httpd/html/pdf/2000001.pdf Read 8192 from document Read 8192 from document Read 2218 from document Read a total of 18602 bytes PDF::setContents(18602 bytes) PDF::parse(http://labweb1/pdf/2000001.pdf) title: P3480 Eclipse Flush Mount PDF::parse: 5095 lines parsed PDF::parse ends normally size = 18602 It looks like it is reading the title, is there a way to index those words along with 5095 lines of text. I don't get a file returned from the search when I search on any of the words in the file. This is the applicable part of the htdig.conf file: # These attributes allow indexing server via local filesystem rather than HTTP. local_urls: http://labweb1/=/home/httpd/html/ local_user_urls: http://labweb1/=/home/,/public_html/ pdf_parser: /bin/acroread -toPostScript -pairs #external_parsers: application/msword /usr/local/bin/parse_doc.pl \ application/postscript /usr/local/bin/parse_doc.pl \ application/pdf /usr/local/bin/parse_doc.pl I would appreciate it if you could point me in the right direction. This is driving me nuts. If I need to provide any further information, I would be glad to. TIA, I appreciated it. Wayne
------------------------------------ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
