Here is a Perl script that uses the catdoc program (V 0.90).
Download catdoc stuff from URL below, untargz, ./configure
etc. Set $CATDOC to catdoc proggie.
Set external_parsers to something like:
external_parsers: application/msword
/usr/local/htdig/external_parsers/bin/parse_word_doc.pl
And it should run. Note that catdoc is beta release and sometimes fails
to
parse Word doc. This Perl script takes a long time on large Word
files....
--jesse
#!/usr/local/gnu/bin/perl
#########################################
#
# set this to your catdoc proggie
#
# get it from: http://www.fe.msk.ru/~vitus/catdoc/
#
$CATDOC = "/usr/local/htdig/external_parsers/catdoc/bin/catdoc";
# need some var's
#empty array
@allwords = ();
$x = 0;
$line = "";
@fields = ();
$calc = 0;
#
# okay. my programming style isn't that nice, but it works...
#for ($x=0; $x<@ARGV; $x++) {
# print STDERR "$ARGV[$x]\n";
#}
open(CAT, "$CATDOC -a -w $ARGV[0] |") || die "Hmmm. Something is
wrong.\n";
while ($line = <CAT>) {
@fields = split(/\s+/,$line);
for ($x=0; $x<@fields; $x++) {
if ($fields[$x] =~ /\w/) {
@allwords = (@allwords, $fields[$x]);
}
}
}
close CAT;
#############################################
# print out the title
print "t\tWord Document $ARGV[2]\n";
#############################################
# print out the head
$calc = @allwords;
print "h\t";
#if ($calc >100) { # but not more than 100 words
# $calc = 100;
#}
for ($x=0; $x<$calc; $x++) {
print "$allwords[$x] ";
}
#############################################
# now the words
for ($x=0; $x<@allwords; $x++) {
$calc=int(1000*$x/@allwords); # calculate rel.
position (0-1000)
print "w\t$allwords[$x]\t$calc\t0\n"; # print out word, rel.
pos. and text type (0)
}
BLKA.DEZ54 wrote:
>
> Who give me a example to embed an wordviewer into htdig (htparsedoc or
> else)
>
> Thanks
> ----------------------------------------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> [EMAIL PROTECTED] containing the single word "unsubscribe" in
> the body of the message.
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.