> According to David Adams:
> > I have downloaded the parse_doc.pl script, and the xpdf and catdoc
> > utilities, and I am now using them to extend our search index to include
> > Word and PDF files. It all works well and with a bit of alteration to
> > the Perl script does exactly what I want. My thanks to the developers!
>
> I forgot to ask before, what were your alterations? Something very
> specific to your needs, or something worth sharing with other?
>
> --
> Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
Well, since you ask, I noticed two problems with PDF files on our site:
1. the titles were often meaningless, having no connection with
the contents.
2. pdftotext outputs some spurious non-ascii gibberish that is
then indexed.
I modified the code which outputs the title to always include the
type, and to put any extracted title in double quotes or the filename
in square brackets:
# if no title use filename from URL
if (not length($title)) {
$title = $ARGV[2];
$title =~ s#^.*/##;
$title = '[' . $title . ']';
} else {
$title = '"' . $title . '"';
}
print "t\t$title ($type Document)\n";
To throw away the spurious "words" I simplified the code to replace
all non-alphanumerics with spaces. I appreciate that many people would
think that too drastic:
while (<CAT>) {
while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
$_ .= <CAT> || break;
s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
}
$head .= " " . $_;
# s/\s+[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\$
# s/[\255]/-/g; # replace dashes with $
s/\W/ /g; # replace non-alphanumeric characters with spaces
s/\s+/ /g; # replace multiple spaces, etc. with a single space
@fields = split; # split up line
next if (@fields == 0); # skip if no fields (do$
for ($x=0; $x<@fields; $x++) { # check each field if s$
if (length($fields[$x]) >= $minimum_word_length) {
push @allwords, $fields[$x]; # add to list
}
}
}
The spurious output is nolonger indexed, but it does remain in the head,
so there is further room for improvement.
--
David J Adams
<[EMAIL PROTECTED]>
Computing Services
University of Southampton
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.