Re: [htdig] PDF indexing

Gilles Detillieux Fri, 13 Aug 1999 13:31:17 -0700

According to [EMAIL PROTECTED]:
> It says my PDF files are encryped and xpdf doesnt do encryped files.
> 
> Do I have any alternatives to get it to include the PDF files in the
> search?

There are a few alternatives.  There are decryption patches available
for xpdf.  There should be a link to this on the foolabs.com web site.
That may do the trick for you.

Another option would be to install Acrobat 3, and use it to index your
PDFs.

A third option - and this is the one I'd like you to try first -
is to apply the following patch to htdig 3.1.2, to get it to work
with Acrobat 4.  You'll also have to remove the -pairs option from any
pdf_parser definition you've added to your htdig.conf file, if you defined
this attribute.  I've made the change to the 3.2 development source,
but haven't had a chance to test it out yet.  (Getting the current 3.2
development source to compile and install successfully on my system
is proving to be a major ordeal.)  I'd appreciate any feedback as to
whether this works.

In the htdig-3.1.2 source directory, "patch -p < {this e-mail message}":
---------------------
--- htdig/PDF.cc.orig   Tue Mar 23 17:17:33 1999
+++ htdig/PDF.cc        Fri Aug 13 16:05:16 1999
@@ -104,13 +104,22 @@ PDF::parse(Retriever &retriever, URL &ur
         acroread = "acroread";
 
     // Check for existance of acroread program! (if not, return)
-    //struct stat stat_buf;
-    // Check that it exists, and is a regular file. 
-    //if ((stat(acroread, &stat_buf) == -1) || !S_ISREG(stat_buf.st_mode))
-    //  {
-    // printf("PDF::parse: cannot find acroread\n");
-    // return;
-    //  }
+    struct stat stat_buf;
+    static int notfound = 0;
+    if (notfound)      // we only need to complain once
+       return;
+    String arg0 = acroread;
+    char *endarg = strchr(acroread.get(), ' ');
+    if (endarg)
+       *endarg = '\0';
+    // If first arg is a path, check that it exists, and is a regular file. 
+    if (strchr(arg0.get(), '/') &&
+       ((stat(arg0.get(), &stat_buf) == -1) || !S_ISREG(stat_buf.st_mode)))
+    {
+       printf("PDF::parse: cannot find pdf parser %s\n", arg0.get());
+       notfound = 1;
+       return;
+    }
 
     // Write the pdf contents in a temp file to give it to acroread
 
@@ -140,9 +149,19 @@ PDF::parse(Retriever &retriever, URL &ur
 
 
     // Use acroread as a filter to convert to PostScript.
-    // Now generalized to allow xpdf as a parser (works with most recent xpdf)
+    // Now generalized to allow xpdf as a parser, or other compatible parsers
+    // (It was claimed it works with most recent xpdf, but it doesn't!)
     //    acroread << " -toPostScript " << pdfName << " " << tmpdir << " 2>&1";
-    acroread << " " << pdfName << " " << psName << " 2>&1";
+    String dest = psName;
+    if (strstr(acroread.get(), "acroread"))
+    {
+       // special-case tests only for acroread (what else you gonna use?)
+       if (!strstr(acroread.get(), "-toPostScript"))
+           acroread << " -toPostScript ";      // add missing option
+       if (!strstr(acroread.get(), "-pairs"))  // don't use -pairs with 4.0
+           dest = tmpdir;
+    }
+    acroread << " " << pdfName << " " << dest << " 2>&1";
 
     if (system(acroread))
     {
--- htcommon/defaults.cc.orig   Thu Mar 25 11:49:40 1999
+++ htcommon/defaults.cc        Fri Aug 13 16:05:16 1999
@@ -21,7 +21,7 @@ ConfigDefaults        defaults[] =
     {"database_dir",                   DATABASE_DIR},
     {"bin_dir",                                BIN_DIR},
     {"image_url_prefix",               IMAGE_URL_PREFIX},
-    {"pdf_parser",                      PDF_PARSER " -toPostScript -pairs"},
+    {"pdf_parser",                      PDF_PARSER " -toPostScript"},
     {"version",                                VERSION},
 
     //
--- htdoc/attrs.html.orig       Thu Mar 25 11:35:03 1999
+++ htdoc/attrs.html    Fri Aug 13 16:08:20 1999
@@ -4025,7 +4025,7 @@
                        <em>default:</em>
                  </dt>
                  <dd>
-                       acroread -toPostScript -pairs
+                       acroread -toPostScript
                  </dd>
                  <dt>
                        <em>description:</em>
@@ -4037,20 +4037,44 @@
                      <em>infile outfile</em>,<br>
                      where <em>infile</em> is a file to parse and
                      <em>outfile</em> is the PostScript output of the
-                     parser. The program is supposed to convert to a
+                     parser. In the case where acroread is the parser, and
+                     the -pairs option is not given, the second parameter
+                     will be the output directory rather than the output
+                     file name. The program is supposed to convert to a
                      variant of PostScript, which is then parsed
-                     internally. Currently, Adobe's <a
+                     internally. Currently, only Adobe's <a
                      href="http://www.adobe.com/prodindex/acrobat/readstep.html">
-                     acroread</a> program and the pdftops program
-                     that is part of the <a
+                     acroread</a> program has been tested as a pdf_parser.
+                     There is a bug in Acrobat 4's acroread command, which
+                     causes it to fail when -pairs is used, hence the special
+                     case above.<br>
+                      The pdftops program that is part of the <a
                      href="http://www.foolabs.com/xpdf/">xpdf</a>
-                     0.80 package have been tested as pdf_parsers.
+                     package is not suitable as a pdf_parser,
+                     because its variant of PostScript is slightly
+                     different.  However, an alternative is to
+                     use xpdf's pdftotext program as a component
+                     of an <a href="#external_parsers">external
+                     parser</a> with the xpdf 0.90 package installed
+                     on your system, as described in FAQ question <a
+                     href="FAQ.html#q4.9">4.9</a>.<br>
+                      In either case, to successfully index PDF files,
+                     be sure to set the <a
+                     href="#max_doc_size">max_doc_size</a> attribute
+                     to a value larger than the size of your largest
+                     PDF file. PDF documents can not be parsed if they
+                     are truncated.
+                       <p>
+                         The default value of this attribute is determined at
+                         compile time, to include the path to the acroread
+                         executable.
+                       </p>
                  </dd>
                  <dt>
                        <em>example:</em>
                  </dt>
                  <dd>
-                       pdf_parser: /usr/local/bin/acroread -toPostScript -pairs
+                       pdf_parser: /usr/local/Acrobat3/bin/acroread -toPostScript 
+-pairs
                  </dd>
                </dl>
          </dd>
---------------------

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.
Re: [htdig] PDF indexing

Reply via email to