Hi,

I've tried doc2html.pl, but am also having problems indexing word documents.
I have the following line in the doc2html.pl script:

#version of catdoc for Word6, Word7 & Word97 files:
my $CATDOC = '/usr/local/bin';

And, I have the catdoc package installed (with the catdoc binaries in
/isr/local/bin and /usr/local/lib)

After changing the line in htdig.conf to use doc2html.pl, I get the
following error messages rundig tries to index word documents:

http://10.5.1.35/sme/micro/test.doc: !          UNABLE to convert  size =
11264

any suggestions on what could be wrong ?

Also, is there any benefit of using doc2html.pl over conv_doc.pl to index
.pdf documents for htDig. ?

I have been using conv_doc.pl and it has been giving me very satisfactory
results, I did try doc2html.pl as well to see if there was any difference...
however with doc2html.pl I found that less pdf documents were indexed and
all the excerpts on htDig search results were garbled.

(e.g. 0000000016 00000 n 0000001025 00000 n 0000001337 00000 n 0000001543
00000 n 0000001750 00000 n 0000001789 00000 n 0000002255 00000 n 0000002455
00000 n 0000002643 00000 n 0000003042 00000 n 0000003064 00000 n 00000).

Thanks for your help,

Shams

----- Original Message -----
From: "Gilles Detillieux" <[EMAIL PROTECTED]>
To: "shams khan" <[EMAIL PROTECTED]>
Cc: "ht://Dig" <[EMAIL PROTECTED]>
Sent: Monday, November 04, 2002 9:52 PM
Subject: Re: [htdig] using conv_doc.pl to index MS Word documents


> According to shams khan:
> > I've used conv_doc.pl (with XPDF) to index PDF documents.  I am now
> > trying to index MS Word documents, but am having problems.
> >
> > I've copied the conv_doc.pl script into /usr/local/bin, which contains
> > the line:
> >
> > $CATDOC = "/usr/local/bin/catdoc";
> >
> > I've installed the CATDOC package (which has placed the catdoc binary in
> > /usr/local/bin and /usr/local/lib)
> >
> > I've placed the follwing line within the htdig.conf file:
> >
> > application/msword->text/html /usr/local/bin/conv_doc.pl
> >
> > But when I try and re-index my website (this time, with the hope of
> > indexing word documents too), i get the following error message which
> > apeears next to the word documents:
> >
> > test.doc: can't determine type of file
/var/www/html/htdig/dv/htdex.8KvYOL; content-type: application/msword; URL:
http://10.5.1.35/sme/micro/management_self_assessment_guide/test/doc size =
11264
>
> I suggest you try doc2html.pl instead of conv_doc.pl.  conv_doc allows
> only one "magic number" for recgnizing Word documents, whereas I think
> doc2html allows a few different ones.  Not all Word documents have the
> same identifying byte sequence at the start.
>
> --
> Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
> Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
> Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: ApacheCon, November 18-21 in
> Las Vegas (supported by COMDEX), the only Apache event to be
> fully supported by the ASF. http://www.apachecon.com
> _______________________________________________
> htdig-general mailing list <[EMAIL PROTECTED]>
> To unsubscribe, send a message to
<[EMAIL PROTECTED]> with a subject of unsubscribe
> FAQ: http://htdig.sourceforge.net/FAQ.html
>


-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm 
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to