> There's still a bit more work to be done. Patrick mentioned that
> pdftotext changed hyphens to spaces. Not so, but parse_doc.pl does.
> In fact, it converts all punctuation to spaces, to separate out the words.
> The problem is right now, the word list is what it spits out for the
> "h" record as well. So there's no punctuation at all in the excerpts!
OK, here's take 2 on my parse_doc.pl patch, to support pdftotext.
Apart from some cleaning up, and the same additions as my earlier (and
now obsolete) patch, it builds a separate string for the head record,
with processing on it equivalent to what htdig does on plain text files.
It seems to work like a charm on my PDFs (with the patch to pdftotext
I posted earlier). I'd like a few other PDF users to try it out as an
external parser for application/pdf documents on their systems. Also,
if anyone with more perl experience than me (going on a few hours now)
can critique the code - either my changes or the original code - I'd
appreciate the edification.
You can pick up the latest script from
http://www.scrc.umanitoba.ca/htdig/rpms/parse_doc.pl
or apply the patch below. This patch should be applied to the original
contrib/parse_doc.pl shipped with htdig-3.1.1.tar.gz:
--- contrib/parse_doc.pl.nopdf Tue Feb 16 23:03:39 1999
+++ contrib/parse_doc.pl Thu Feb 25 15:16:43 1999
@@ -10,9 +10,15 @@
# Changed: push line semi-colomn wrong. <[EMAIL PROTECTED]>
# Changed: matching works for end of lines now <[EMAIL PROTECTED]>
# Added: option to rigorously delete all punctuation <[EMAIL PROTECTED]>
+#
+# 1999/02/09
# Added: option to delete all hyphens <[EMAIL PROTECTED]>
-# Changed: uses ps2ascii to handle PS files <[EMAIL PROTECTED]>
+# Added: uses ps2ascii to handle PS files <[EMAIL PROTECTED]>
+# 1999/02/15
# Added: check for some file formats
<[EMAIL PROTECTED]>
+# 1999/02/25
+# Added: uses pdftotext to handle PDF files <[EMAIL PROTECTED]>
+# Changed: generates a head record with punct. <[EMAIL PROTECTED]>
#########################################
#
# set this to your MS Word to text converter
@@ -34,8 +40,14 @@
# get it from the ghostscript 3.33 (or later) package
#
$CATPS = "/usr/bin/ps2ascii";
+#
+# set this to your PDF to text converter
+# get it from the xpdf 0.80 package at http://www.foolabs.com/xpdf/
+#
+$CATPDF = "/usr/bin/pdftotext";
# need some var's
+$head = "";
@allwords = ();
@temp = ();
$x = 0;
@@ -57,6 +69,10 @@
$parser = $CATPS; # gs 3.33 leaves _temp_.??? files in .
$parsecmd = "(cd /tmp; $parser; rm -f _temp_.???) < $ARGV[0] |";
$type = "PostScript";
+} elsif ($magic =~ /%PDF-/) { # it's PDF (Acrobat)
+ $parser = $CATPDF;
+ $parsecmd = "$parser $ARGV[0] - |";
+ $type = "PDF";
} elsif ($magic =~ /WPC/) { # it's WordPerfect
$parser = $CATWP;
$parsecmd = "$parser $ARGV[0] |";
@@ -77,6 +93,7 @@
# open it
open(CAT, "$parsecmd") || die "Hmmm. $parser doesn't want to be opened using pipe.\n";
while (<CAT>) {
+ $head .= " " . $_;
s/\s[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]\s|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]$/
/g; # replace reading-chars with space (only at end or begin of word)
# s/[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]/ /g; # rigorously replace all by
<[EMAIL PROTECTED]>
s/-/ /g; # replace hyphens with space
@@ -101,15 +118,22 @@
#############################################
# print out the head
-$calc = @allwords;
-print "h\t";
-#if ($calc >100) { # but not more than 100 words
-# $calc = 100;
+$head =~ s/^\s+//g;
+$head =~ s/\s+$//g;
+$head =~ s/\s+/ /g;
+$head =~ s/&/\&\;/g;
+$head =~ s/</\<\;/g;
+$head =~ s/>/\>\;/g;
+print "h\t$head\n";
+#$calc = @allwords;
+#print "h\t";
+##if ($calc >100) { # but not more than 100 words
+## $calc = 100;
+##}
+#for ($x=0; $x<$calc; $x++) { # print out the words for the exerpt
+# print "$allwords[$x] ";
#}
-for ($x=0; $x<$calc; $x++) { # print out the words for the exerpt
- print "$allwords[$x] ";
-}
-print "\n";
+#print "\n";
#############################################
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.